Two recent papers from Anthropic attempt to shed light on the processes that take place within a large language model, exploring how to locate interpretable concepts and link them to the computational “circuits” that translate them into language, and how to characterize crucial behaviors of Claude Haiku 3.5, including hallucinations, planning, and other key traits.
The internal mechanisms behind large language models’ capabilities remain poorly understood, making it difficult to explain or interpret the strategies they use to solve problems. These strategies are embedded in the billions of computations that underpin each word the model generates—yet they remain largely opaque, according to Anthropic. To explore this hidden layer of reasoning, Anthropic researchers have developed a novel approach they call the “AI Microsope”:
We take inspiration from the field of neuroscience, which has long studied the messy insides of thinking organisms, and try to build a kind of AI microscope that will let us identify patterns of activity and flows of information.
In very simplified terms, Anthropic’s AI microscope involves replacing the model under study with a so-called replacement model, in which the model’s neurons are replaced by sparsely-active features that can often represent interpretable concepts. For example, one such feature may fire when the model is about to generate a state capital.
Naturally the replacement model won’t always produce the same output as the underlying model. To address this limitation, Anthropic researchers use a local replacement model for each prompt they want to study, created by incorporating error terms and fixed attention patterns to the replacement model.
[A local replacement model] produces the exact same output as the original model, but replaces as much computation as possible with features.
As a final step to describe the flow of features through the local replacement model from the initial prompt to the final output, the researchers created an attribution graph. This graph is built by pruning away all features that do not affect the output.
Keep in mind that this is a very rough overview of Anthropic’s AI microscope. For full details, refer to the original paper linked above.
Using this approach, Anthropic researches have come to a number of interesting results. Speaking of multilingual capabilities, they found evidence for some kind of universal language that Claude uses to generate concepts before translating them into a specific language.
We investigate this by asking Claude for the “opposite of small” across different languages, and find that the same core features for the concepts of smallness and oppositeness activate, and trigger a concept of largeness, which gets translated out into the language of the question.
Another interesting finding goes against the general understanding that LLMs build their output word-by-word “without much forethought”. Instead, studying how Claude generates rhymes shows it actually plans ahead.
Before starting the second line, it began “thinking” of potential on-topic words that would rhyme with “grab it”. Then, with these plans in mind, it writes a line to end with the planned word.
Anthropic researchers also dug into why model sometimes make up information, a.k.a hallucinate. Hallucination is in some sense intrinsic to how models work, since they are supposed to always produce a next guess. This implies models must rely on specific anti-hallucination training to counter that tendency. In other words, there are two distinct mechanisms at play: one identifying “known entities” and another corresponding to “unknown name” or “can’t answer”. Their correct interplay is what guards models from hallucinating:
We show that such misfires can occur when Claude recognizes a name but doesn’t know anything else about that person. In cases like this, the “known entity” feature might still activate, and then suppress the default “don’t know” feature—in this case incorrectly. Once the model has decided that it needs to answer the question, it proceeds to confabulate: to generate a plausible—but unfortunately untrue—response.
Other interesting dimensions explored by Anthropic researchers concern mental math, producing a chain-of-thought explaining the reasoning to get to an answer, multi-step reasoning, and jailbreaks. You can get the full detail in Anthropic’s papers.
Anthropic’s AI microscope aims to contribute to interpretability research and to eventually provide a tool that help us understand how models produce their inference and make sure they are aligned with human values. Yet, it is still an incipient effort that only goes so far as capturing a tiny fraction of the total model computation and can only be applied to small prompts with tens of words. InfoQ will continue to report on advancements in LLM interpretability as new insights emerge.