By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Anthropic’s “AI Microscope” Explores the Inner Workings of Large Language Models
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Anthropic’s “AI Microscope” Explores the Inner Workings of Large Language Models
News

Anthropic’s “AI Microscope” Explores the Inner Workings of Large Language Models

News Room
Last updated: 2025/04/12 at 2:34 PM
News Room Published 12 April 2025
Share
SHARE

Two recent papers from Anthropic attempt to shed light on the processes that take place within a large language model, exploring how to locate interpretable concepts and link them to the computational “circuits” that translate them into language, and how to characterize crucial behaviors of Claude Haiku 3.5, including hallucinations, planning, and other key traits.

The internal mechanisms behind large language models’ capabilities remain poorly understood, making it difficult to explain or interpret the strategies they use to solve problems. These strategies are embedded in the billions of computations that underpin each word the model generates—yet they remain largely opaque, according to Anthropic. To explore this hidden layer of reasoning, Anthropic researchers have developed a novel approach they call the “AI Microsope”:

We take inspiration from the field of neuroscience, which has long studied the messy insides of thinking organisms, and try to build a kind of AI microscope that will let us identify patterns of activity and flows of information.

In very simplified terms, Anthropic’s AI microscope involves replacing the model under study with a so-called replacement model, in which the model’s neurons are replaced by sparsely-active features that can often represent interpretable concepts. For example, one such feature may fire when the model is about to generate a state capital.

Naturally the replacement model won’t always produce the same output as the underlying model. To address this limitation, Anthropic researchers use a local replacement model for each prompt they want to study, created by incorporating error terms and fixed attention patterns to the replacement model.

[A local replacement model] produces the exact same output as the original model, but replaces as much computation as possible with features.

As a final step to describe the flow of features through the local replacement model from the initial prompt to the final output, the researchers created an attribution graph. This graph is built by pruning away all features that do not affect the output.

Keep in mind that this is a very rough overview of Anthropic’s AI microscope. For full details, refer to the original paper linked above.

Using this approach, Anthropic researches have come to a number of interesting results. Speaking of multilingual capabilities, they found evidence for some kind of universal language that Claude uses to generate concepts before translating them into a specific language.

We investigate this by asking Claude for the “opposite of small” across different languages, and find that the same core features for the concepts of smallness and oppositeness activate, and trigger a concept of largeness, which gets translated out into the language of the question.

Another interesting finding goes against the general understanding that LLMs build their output word-by-word “without much forethought”. Instead, studying how Claude generates rhymes shows it actually plans ahead.

Before starting the second line, it began “thinking” of potential on-topic words that would rhyme with “grab it”. Then, with these plans in mind, it writes a line to end with the planned word.

Anthropic researchers also dug into why model sometimes make up information, a.k.a hallucinate. Hallucination is in some sense intrinsic to how models work, since they are supposed to always produce a next guess. This implies models must rely on specific anti-hallucination training to counter that tendency. In other words, there are two distinct mechanisms at play: one identifying “known entities” and another corresponding to “unknown name” or “can’t answer”. Their correct interplay is what guards models from hallucinating:

We show that such misfires can occur when Claude recognizes a name but doesn’t know anything else about that person. In cases like this, the “known entity” feature might still activate, and then suppress the default “don’t know” feature—in this case incorrectly. Once the model has decided that it needs to answer the question, it proceeds to confabulate: to generate a plausible—but unfortunately untrue—response.

Other interesting dimensions explored by Anthropic researchers concern mental math, producing a chain-of-thought explaining the reasoning to get to an answer, multi-step reasoning, and jailbreaks. You can get the full detail in Anthropic’s papers.

Anthropic’s AI microscope aims to contribute to interpretability research and to eventually provide a tool that help us understand how models produce their inference and make sure they are aligned with human values. Yet, it is still an incipient effort that only goes so far as capturing a tiny fraction of the total model computation and can only be applied to small prompts with tens of words. InfoQ will continue to report on advancements in LLM interpretability as new insights emerge.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article How Building a Jira App Led Me to Create PeekNote — a Minimal macOS Notes Tool for Developers | HackerNoon
Next Article Diners wait 30 mins as Waymo driverless car is stuck at Chick-fil-A drive thru
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Save over $100 on a refurbished Dyson Airwrap at Walmart
News
How to Use AI to Translate Your Website (2 Easy Methods)
Computing
What Is PIPEDA: Canadian Data Privacy Law Explained 2025
News
The Fujifilm X-E5 is coming soon — here are the first 5 lenses I would buy
News

You Might also Like

News

Save over $100 on a refurbished Dyson Airwrap at Walmart

2 Min Read
News

What Is PIPEDA: Canadian Data Privacy Law Explained 2025

19 Min Read
News

The Fujifilm X-E5 is coming soon — here are the first 5 lenses I would buy

11 Min Read
News

A.I. Computing Power Is Splitting the World Into Haves and Have-Nots

20 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?