Anthropic Open-sources Tool To Trace The "Thoughts" Of Large Language Models

Anthropic researchers have open-sourced the tool they used to trace what goes on inside a large language model during inference. It includes a circuit tracing Python library that can be used with any open-weights model and a frontend hosted on Neuropedia to explore the library output through a graph.

As InfoQ reported at the time of Anthropic’s original disclosure, their approach to shed light on an LLM’s internal behavior involves replacing the actual model with another one that uses sparsely-active features from cross-layer MLP transcoders instead of the original neurons. These features can often represent interpretable concepts, making it possible to build an attribution graph by pruning away all features that do not influence the output under investigation.

Anthropic’s circuit tracer library can identify replacement circuits and generate attribution graphs from a given model using pre-trained transcoders.

It computes the direct effect that each non-zero transcoder feature, transcoder error node, and input token has on each other non-zero transcoder feature and output logit [Editor’s note: the raw (non-normalized) score a model assigns to each possible output before applying a probability function like softmax].

As one of Anthropic’s researchers noted on Hacker News, the graph reveals intermediate computational steps the model took to sample a token, which can provide useful insights. These insights can then be used to manipulate transcoder features and observe how the model’s output changes, for example.

Anthropic has already used its circuit tracer to study multi-step reasoning and multilingual representations in Gemma-2-2b and Llama-3.2-1b. Below is an example of the attribution graph generated for the prompt “Fact: The capital of the state containing Dallas is”.

In a lenghthy podcast hosted by Dwarkesh Patel featuring Anthropic’s Trenton Bricken and Sholto Douglas, Bricken explained how Anthropic’s research into circuit tracing is a key contribution to LLM mechanistic interpretability, that isthe effort to understand what the core units of computation are inside an LLM. This builds on previous research using toy models, then sparse autoencoders, and eventually circuits.

Now you’re identifying individual features across the layers of the model that are all working together to perform some complicated task. And you can get a much better idea of how it’s actually doing the reasoning and coming to decisions

This is still a very young field, but one that is becoming increasingly critical for the safe use of LLMs:

Depending on how quickly AI accelerates and where the state of our tools are, we might not be in the place where we can prove from the ground up that everything is safe. But I feel like that’s a very good North Star. It’s a very powerful reassuring North Star for us to aim for, especially when we consider we are part of the broader AI safety portfolio

The circuit tracing library can be easily run from Anthropic’s tutorial notebook. Alternatively, you can use it on Neuronpedia or install it locally.

Anthropic Open-sources Tool to Trace the “Thoughts” of Large Language Models

Leave a Reply

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply