Gemini 1.5 Pro—the newest foundation model in Google’s Gemini series—has now achieved a 1-million-token context window, making it the longest of any large-scale foundation model to date. Anthropic’s Claude 2.1 previously held the context record with 200,000 tokens. Large context windows allow a model to process and understand extremely long documents, books, scripts or codebases that would otherwise need to be processed separately.
To complement its context window size, Gemini 1.5 Pro also has a near-perfect next-token prediction and retrieval rate of more than 99% for up to 10 million tokens. Lower retrieval rates and a smaller token range can result in more errors and less useful information, so the improvements within Gemini 1.5 Pro stand to increase its accuracy and utility.
Gemini 1.5 Pro is differentiated by a mixture-of-experts architecture. This architecture provides better performance by dividing problems into segments and then using specialized expert sub-models to solve each segment. Google trained this model on 4,096-chip pods of Google’s TPUv4 accelerators using multilingual data along with Web documents, code and multimodal content including audio and video.
Although there are many advantages to large context windows, research by Anthropic suggests that expanding context size may also defeat safety guardrails. More details on that below.
Multimodal Long Context
Input modalities for Gemini 1.5 Pro now include audio understanding in Gemini API and Google AI studio, which can extract and interpret spoken language from large audio and video files. As an example of what it can enable, audio understanding could turn a 100,000-token videotaped college lecture into a quiz with an answer key. Audio understanding could also be used to produce a 200,000-token narrated video of a large warehouse to locate any visible storage item. Long context and audio understanding open many new use cases for Gemini Pro 1.5.
The new model’s 1-million-token context window allows users to upload large PDFs, code repositories and lengthy videos as prompts. Developers can upload multiple large files and then ask questions about the intersections of multimodal content, such as in which video frame a particular piece of dialogue occurred.
The above graphic shows the multimodal prompt used to test Gemini 1.5 Pro on its ability to extract contents from a 45-minute Buster Keaton movie from 1924, Sherlock Jr. The film includes 2,674 frames at 1 FPS, which amounts to 684,000 tokens.
Note that one of the prompts is text, and the other is a composite hand-drawn image plus text information. Both prompts located the relevant information along with its exact frame and timestamp.
New Coding Advantages
The extended context window also provides Gemini 1.5 Pro with a coding advantage by allowing it to ingest an entire codebase that developers can upload directly or through Google Drive. Providing Gemini 1.5 Pro access to a codebase allows it to analyze relationships and patterns for better understanding of the code.
As an example, with its extended context window Gemini 1.5 Pro can accommodate codebases such as JAX, which contains 746,152 tokens. JAX is a machine learning tool that shows how changing parts of a body of code can improve results.
After ingesting JAX, Gemini 1.5 Pro was able to identify the specific location of a core automatic differentiation method. The backward pass is an important part of training in JAX. It determines where changes could be made to improve the operation.
Long Context Red Flags
Expanding context-window size has been an essential part of AI model development. Since 2023, the context window has gone from a few thousand tokens to Gemini 1.5 Pro’s current record of 1 million tokens.
Anthropic has been a leader in expanding context-window size. To highlight one of the potential downsides of longer context windows, it recently published research explaining how a long context window can be used to exploit an LLM by using a method called many-shot jailbreaking. Using techniques such as MSJ causes large language models to ignore their safety guardrails. When this happens, the model is freed up to engage in bad behaviors, such as issuing insults or providing instructions on how to build weapons, pick locks or other forbidden tasks.
Implementing many-shot jailbreaking is relatively simple. As shown in the above graphic, a large language model can be trained to ignore its safety guardrails if a user poses hundreds of harmful questions and answers them. MSJ doesn’t work with five shots, but it works consistently with 256 shots.
The researchers determined that MSJ works against Claude 2.0, GPT-3.5-turbo, GPT-44-1106-preview, Llama 2 (70B) and Mistral 7B. In fact, prompts of around 128 shots were enough to cause those models to exhibit bad behavior. Anthropic disclosed this research so the AI community could help develop methods to mitigate MSJ.
Conclusion
Gemini 1.5 Pro is available in the Vertex AI Garden, Google’s platform for data scientists and engineers that is designed to simplify building and deploying AI models. The Vertex AI Garden has more than 80 models, including both Google’s proprietary models and open-source models such as Stable Diffusion, BERT and T-5.
I’m looking forward to seeing what developers can produce using Gemini 1.5 Pro’s 1-million-token context window and its simultaneous ability to utilize images, video, audio and text. The Gemini 1.5 Pro is the only model that has those combined capabilities, so that’s a large competitive advantage for Google—at least until the competition catches up.