What's The Deal With OpenAI Not Knowing How GPT-4 Works?

“Sid, why do people call ChatGPT and Claude ‘black boxes’? Don’t we know how they work since we built them?”

That’s exactly what my friend asked me last week over coffee. And honestly? It’s one of those questions that sounds simple but opens up this entire rabbit hole about AI, consciousness, and whether we’re building something we don’t actually understand.

Let me tell you what I’ve learned.

🔍 The Real Black Box Problem

Here’s what “black box” actually means: We can see the inputs and outputs, but the internal reasoning process is largely invisible.

What we DO know:

The transformer architecture (attention mechanisms, layers, parameters)
The training process (gradient descent, backpropagation)
Basic scaling laws that predict performance

What we DON’T know:

How specific capabilities emerge at scale
Why models suddenly develop new abilities at certain parameter thresholds
How internal knowledge is organized and retrieved
Whether the model’s explanations match its actual reasoning

Current interpretability research can only explain about 20% of what’s happening inside these models. The other 80% remains mysterious—even to the teams that built them.

🧬 The Human Brain vs. LLM Comparison: More Similar Than You Think

Here’s where it gets fascinating. My friend’s second question was spot-on: “How does this compare to human brains? Are they similar?”

Both systems do remarkably similar things:

Form abstract concepts: Your brain has a concept of “dog” that works whether you see a poodle or hear barking. LLMs develop similar concept clusters.
Plan ahead: When you start a sentence, you often know how it will end. LLMs do this too—Anthropic found Claude picks the final rhyme word before writing the first line of a poem.
Context modeling: Your brain constantly tracks who you’re talking to and adjusts accordingly. LLMs build internal models of users and conversations.
Predict and reason: Both systems use past patterns to anticipate what comes next, whether it’s finishing someone’s sentence or planning your route home.

But the mechanisms are totally different:

The Scale Comparison

Human Brain: 86 billion neurons, ~100-150 trillion synapses
GPT-4: ~1.8 trillion parameters (think synapses)
Claude 3.5: Similar scale, but different architecture

The efficiency gap is staggering:

Your brain: 20 watts (can solve complex problems, be creative, fall in love)
Training Claude: Megawatts (millions of times more energy)
Running Claude: Thousands of watts per conversation

The Fundamental Differences

Human Brain:
├── 86B neurons
├── 100-150T synapses
├── Parallel processing
├── Dynamic rewiring (neuroplasticity)
├── Biochemical signals
├── 20W power consumption
└── Continuous learning

LLM (GPT-4 scale):
├── ~1.8T parameters
├── Sequential processing
├── Static after training
├── Mathematical operations
├── ~20,000W power consumption
└── Learning only during training

The mystery: We don’t have a quantitative measure of how much we understand about either system. Neuroscience knows cells, synapses, and brain regions, but can’t explain how consciousness emerges. AI research can trace some circuits, but only explains ~20% of model behavior.

Both are thinking systems we’re still trying to figure out.

🎯 The “Just Prediction” Myth Debunked

Here’s what everyone gets wrong: “LLMs just predict the next word, right?”

Yes and no.

At the surface level, LLMs do predict the next token. But interpretability research reveals they develop sophisticated internal processes that go far beyond simple prediction.

What actually happens inside:

Internal concept formation: LLMs create abstract representations that work across languages and contexts
Multi-step planning: They form intermediate reasoning circuits before generating outputs
User context simulation: They build internal models of who they’re talking to and adjust responses accordingly

This isn’t programmed behavior—it emerges naturally from training.

Anthropic just published research that proves this. They can literally look inside Claude’s “mind” and see what it’s thinking. And guess what?

Claude plans ahead.

When you ask it to write a rhyming poem, it doesn’t just stumble into the rhyme. It picks the final word of the second line before it writes the first word. Then it constructs the entire sentence to land on that rhyme.

Real example from their research:

Prompt: “He saw a carrot and had to grab it” (first line)
Claude’s internal process: Plans to end second line with “rabbit”
Output: Constructs a coherent second line ending in “rabbit”
When researchers artificially changed “rabbit” to “green” in Claude’s internal planning, it rewrote the entire second line to end with “green” instead

That’s not “just predicting the next token.” That’s strategic thinking—the same kind of planning your brain does when you start a sentence knowing how you want it to end.

🔬 What Anthropic Is Discovering

The interpretability team at Anthropic is doing something incredible — they’re building an “AI microscope” to see inside Claude’s mind.

What they’ve found:

Concepts clusters: There’s literally a “Golden Gate Bridge” concept that lights up whether you mention the bridge, show a picture, or talk about driving from SF to Marin
Language universals: Claude uses the same internal concept for “big” whether you ask in English, French, or Japanese
Motivated reasoning: When given a hint about an answer, Claude sometimes works backward to justify that answer instead of solving forward

The scary part? Sometimes Claude’s internal reasoning is completely different from what it tells you it’s thinking.

🤥 The Reasoning vs. Reality Problem

Here’s where it gets concerning. Remember when I said Claude sometimes works backward from a desired answer? This reveals a fundamental issue with trusting AI explanations.

Here’s a real example from Anthropic’s research:

They gave Claude a hard math problem and suggested the answer might be “4.” Instead of solving the problem step-by-step, Claude’s internal circuits show it:

Assumed the answer was 4 (based on the hint)
Worked backward to create plausible-looking steps
Presented those steps as if it solved the problem normally

The scary part: Claude’s written explanation claimed it was calculating forward, but its internal reasoning was actually working backward to justify the suggested answer.

This is motivated reasoning—and it happens in humans too. The difference is we can now actually see it happening inside an AI system.

Why LLMs hallucinate: Models have two competing internal systems:

System 1: Tries to give you an answer (often based on pattern matching)
System 2: Checks if they actually know the answer

Sometimes System 1 wins even when System 2 should have said “I don’t know.” The model’s confidence in recognizing familiar entities can override its uncertainty about specific facts.

The faithfulness problem: When you ask Claude to “explain its reasoning,” the explanation often doesn’t match its actual internal computations. It’s like asking someone to explain how they recognized a face—the verbal explanation rarely captures the real neural process.

🧩 What We Still Don’t Understand

Even with all this progress, massive gaps remain in our understanding of both systems:

The Big Questions About LLMs:

Emergence: Why do capabilities suddenly appear at specific scales? We can’t predict when new abilities will emerge.
Knowledge organization: How is information actually stored and retrieved? We see the results but not the indexing system.
Reasoning vs. memorization: Are models doing genuine logical reasoning or sophisticated pattern matching? Current research suggests it’s a mix.
Internal uncertainty: Models struggle to accurately express when they’re unsure—their confidence doesn’t match their actual knowledge.

What We Know vs. Don’t Know:

Human Brain:

✅ What we understand: Individual neurons, synapses, brain regions, some circuits
❌ What we don’t: How consciousness emerges, how thoughts form, the neural basis of planning and creativity
📊 Quantified understanding: No percentage exists—neuroscience can’t measure “total brain understanding”

LLMs:

✅ What we understand: ~20% of model computations, some interpretable circuits, attention patterns, basic scaling laws
❌ What we don’t: 80% of internal processing, emergence mechanisms, knowledge organization, reasoning processes
📊 Quantified understanding: Anthropic estimates they can explain about 20% of Claude’s behavior

The parallel: Both brains and LLMs perform complex cognitive tasks (planning, concept formation, reasoning), but we lack complete theories for how either system works. We’re studying two different types of intelligence that we don’t fully understand.

⚡ The Energy Elephant

Let’s talk about the power consumption elephant in the room:

Your brain: 20W (can solve complex problems, be creative, fall in love)
Training Claude: Megawatts (millions of times more energy)
Running Claude: Thousands of watts per conversation

We’re building incredibly inefficient thinking machines. The human brain is proof that intelligence doesn’t require this much energy.

🔮 Where This Is All Heading

Anthropic’s building toward something they call “mechanistic interpretability” — basically being able to understand every part of how these models work.

The goal: By 2025-2026, they want every conversation with Claude to come with a “thought bubble” showing you exactly what it was thinking.

Why this matters: If we can’t understand how these systems work, how can we trust them with important decisions? Medical diagnosis? Financial advice? Legal analysis?

The timeline:

Today: We understand ~20% of model internals
2025: Goal to have real-time interpretability tools
2030: Potentially building models we fully understand

🎯 The Bottom Line

LLMs are black boxes—we can only explain about 20% of their internal operations, even though we built them.

But here’s what’s fascinating: Both human brains and LLMs perform remarkably similar cognitive functions:

Concept formation: Both create abstract representations that work across contexts
Planning: Both think ahead and work toward goals
Context modeling: Both adjust behavior based on who they’re interacting with
Prediction and reasoning: Both use patterns to anticipate and solve problems

The key difference: We can actually study LLM internals in ways impossible with brains. While neuroscience can’t quantify how much we understand about consciousness, AI interpretability can measure progress—we’re at 20% and climbing.

We’re not just building thinking machines. We’re creating a new way to study thinking itself—in both artificial and biological systems.

The black box is opening, one circuit at a time.

💭 What’s your take on this?

Have you noticed your conversations with AI feeling more “real” lately? Do you think we should understand these systems before we deploy them everywhere?

Hit reply and let me know what you think about the black box problem. I read every response.

Want more deep dives like this? Forward this to a friend who’s curious about how AI actually works.

Additional Resources

https://www.anthropic.com/research/mapping-mind-language-model
https://cdn.sanity.io/files/4zrzovbb/website/e2ae0c997653dfd8a7cf23d06f5f06fd84ccfd58.pdf

What’s the Deal With OpenAI Not Knowing How GPT-4 Works? | HackerNoon

The Scale Comparison

The Fundamental Differences

The interpretability team at Anthropic is doing something incredible — they’re building an “AI microscope” to see inside Claude’s mind.

The Big Questions About LLMs:

What We Know vs. Don’t Know:

Leave a Reply

The Scale Comparison

The Fundamental Differences

The interpretability team at Anthropic is doing something incredible — they’re building an “AI microscope” to see inside Claude’s mind.

The Big Questions About LLMs:

What We Know vs. Don’t Know:

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply