Gemini 1.5: A Giant Leap in Long-Context AI

Google DeepMind unveiled its latest AI system, Gemini 1.5 Pro, representing a major advance in models’ ability to understand and reason over extremely long context across multiple modalities like text, images, audio and video.

Gemini 1.5 Pro compared to Gemini 1.0 family. Gemini 1.5 Pro maintains high levels of performance even as its context window increases.

The core innovation of Gemini 1.5 is its dramatically expanded context length, enabling it to incorporate up to 10 million tokens of context – a 100x increase over previous state-of-the-art models like Claude 2.1 (200k tokens). This allows Gemini 1.5 to process huge amounts of real-world data like lengthy documents, books, codebases and hours of video and audio.

In extensive evaluations, Gemini 1.5 demonstrated near-perfect recall on “needle in a haystack” benchmarks, reliably retrieving information from contexts up to 10 million tokens – equivalent to 10,000 pages of text. It also excelled at question answering using the full 700,000+ word text of Les Miserables and showed skill at learning new skills like translating English to obscure languages using just reference materials provided in-context.

Text Haystack. This figure compares Gemini 1.5 Pro with GPT-4 Turbo for the text needle-in-a-haystack task. Green cells indicate the model successfully retrieved the secret number, gray cells indicate API errors, and red cells indicate that the model response did not contain the secret number. The top row shows results for Gemini 1.5 Pro, from 1k to 1M tokens (top left), and from 1M to 10M tokens (top right). The bottom row shows results on GPT-4 Turbo up to the maximum supported context length of 128k tokens. The results are color-coded to indicate: green for successful retrievals and red for unsuccessful ones.

Remarkably, Gemini 1.5 achieves this long-context prowess while still matching or exceeding the performance of Google’s previous best model, Gemini 1.0 Ultra, across a broad range of core capabilities like mathematical reasoning, code generation, and multilinguality. And it does so while requiring significantly less training compute and deployment resources.

The dramatically expanded context window unlocks new practical applications previously not possible. For example, Gemini 1.5 could enable software engineers to query entire codebases in natural language or allow journalists to deeply explore archives of text, video and audio content. Its in-context learning also raises the exciting potential to rapidly acquire skills like translating new languages with minimal external data.

However, such powerful long-context reasoning does have risks if deployed irresponsibly. DeepMind appears to have invested heavily in safety practices like impact assessment, data filtering, and model tuning to mitigate potential harms. But there are still open questions around how to properly evaluate and control such capable AI systems.

By pushing the boundaries of how much context AI can incorporate and reason over, Gemini 1.5 represents a key milestone in developing more general and capable machine intelligence. While work remains to build appropriate safety measures, its long-context breakthroughs point the way towards AI that can understand and act within the complexities of the real world.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.