Gemma Scope 2 is a suite of tools designed to interpret the behavior of Gemini 3 models, enabling researchers to analyze emergent model behaviors, audit and debug AI agents, and devise mitigation strategies against security issues like jailbreaks, hallucinations and sycophancy.
Interpretability research aims to understand the internal workings and learned algorithms of AI models. As AI becomes increasingly more capable and complex, interpretability is crucial for building AI that is safe and reliable.
Google describes Gemma Scope as a microscope for its LLMs. It combines sparse autoencoders (SAEs) and transcoders to let researchers inspect a model’s internal representation, examine what it “thinks” and understand how those internal states shape its behavior. One key use case is inspecting discrepancies between a model’s output and its internal state, which Google says could help surface safety risks.
Gemma Scope 2 extends the original Gemma Scope, which targeted the Gemma 2 family, in several ways. Most notably, it retrained its SAEs and transcoders across every layer of Gemma 3 models, including skip-transcoders and cross-layer transcoders, which are designed to make multi-step computations and distributed algorithms easier to interpret.
Increasing the number of layers, Google explains, directly increases compute and memory requirements, which required to design specialized sparse kernels to keep complexity scaling linearly with the number of layers.
In addition, Google applied a more advanced training technique to improve Gemma Scope 2’s ability to identify more useful concepts, while also addressing several known flaws in the first implementation. Finally, Gemma Scope 2 introduces tools specifically tailored for chatbot analysis, enabling the study of complex, multi-step behaviors, such as jailbreaks, refusal mechanisms, and chain-of-thought faithfulness.
Sparse autoencoders use a pair of encoder and decoder functions to decompose and reconstruct all LLM inputs. Transcoders, on the other hand, are trained to sparsely reconstruct the computations of a multi-layer perceptron (MLP) sublayer, that is to learn how to approximate their output for a given input. This makes them useful for identifying which parts of each layer and sublayer, or more exactly which patterns of activations, are triggered by individual input tokens and or sequences of tokens.
Besides the application to security issues, redditor Mescalian foresees that this research could:
also help inform best practices in other domains, and in the future this technique probably will be used to monitor more intelligent AIs internal reasoning. Right now though it’s most useful for steering capabilities through fine-tuning and other modification of weights.
Similarly to Google, Anthropic and OpenAI also released their own “AI microscopes” tailored for their own models.
Google has released the weights of Gemma Scope 2 on Hugging Face.
