One of the selling points of Google's flagship generative AI models, Gemini 1.5 Pro and 1.5 Flash, is their ability to process and analyze large amounts of data. Google has claimed that these models can perform tasks that were previously impossible, such as summarizing lengthy documents and searching through film footage scenes.
However, recent research suggests that these models may not be as capable as Google claims. Two separate studies have examined how well Google's Gemini models handle large datasets, such as lengthy works of fiction. The studies found that Gemini 1.5 Pro and 1.5 Flash struggled to answer questions about large datasets correctly, with the models only providing correct answers 40%-50% of the time in one series of tests.
“While models like Gemini 1.5 Pro can technically process long contexts, we have seen many cases indicating that the models don’t actually ‘understand’ the content,” said Marzena Karpinska, a postdoctoral researcher at UMass Amherst and co-author of one of the studies, in an interview with TechCrunch.
Gemini’s Context Window: A Closer Look
A model's context window refers to the amount of input data it can consider before generating an output. This could range from a simple question to a movie script or an audio clip. As context windows expand, they can accommodate larger documents. The latest versions of Gemini can handle up to 2 million tokens as context, which is equivalent to approximately 1.4 million words, two hours of video, or 22 hours of audio.
In a demonstration earlier this year, Google showcased Gemini 1.5 Pro's ability to search the transcript of the Apollo 11 moon landing telecast, which is around 402 pages long, for quotes containing jokes and then find a scene in the telecast similar to a pencil sketch. Google DeepMind's VP of research, Oriol Vinyals, described the model as “magical.”
Research Findings
However, the recent studies tell a different story. In one study, researchers from the Allen Institute for AI and Princeton asked Gemini models to evaluate true/false statements about recent fiction books, ensuring the models couldn't rely on prior knowledge. The results were underwhelming: Gemini 1.5 Pro answered correctly 46.7% of the time, while Flash managed only 20%.
The second study, conducted by researchers at UC Santa Barbara, tested Gemini 1.5 Flash's ability to analyze videos. The model was given a dataset of images paired with questions about the objects depicted. Flash's performance was again disappointing, with only 50% accuracy in transcribing six handwritten digits from a series of images, dropping to 30% with eight digits.
Overpromising and Under-Delivering
Despite these shortcomings, Google has heavily promoted the long-context capabilities of Gemini models. However, both studies indicate that these capabilities may be overstated. Google has not responded to these findings, but the research highlights the need for better benchmarks and third-party evaluations to verify claims about generative AI capabilities.
Generative AI technology, including models like Gemini, is under increasing scrutiny from businesses and investors due to its limitations. Surveys from Boston Consulting Group reveal that many executives are skeptical about the potential productivity gains from generative AI and concerned about the risks of mistakes and data breaches. PitchBook also reports a decline in generative AI deal-making at early stages.
“We haven’t settled on a way to really show that ‘reasoning’ or ‘understanding’ over long documents is taking place, and basically every group releasing these models is cobbling together their own ad hoc evaluations to make these claims,” Karpinska said. “Without the knowledge of how long context processing is implemented — and companies do not share these details — it is hard to say how realistic these claims are.”
Both Saxon and Karpinska advocate for improved benchmarks and greater emphasis on third-party critique to counter the hype around generative AI. Saxon notes that existing benchmarks, often cited by companies like Google, primarily measure a model's ability to retrieve specific information rather than answer complex questions about it.
“All scientists and most engineers using these models are essentially in agreement that our existing benchmark culture is broken,” Saxon said, “so it’s important that the public understands to take these giant reports containing numbers like ‘general intelligence across benchmarks’ with a massive grain of salt.”
more information, click here.
Post a Comment for "Gemini's Data-Analyzing Abilities Fall Short of Google's Claims"