LLMs — “Just a Next Token Predictor”?
Here’s a wild thought: imagine if you got temporary amnesia between every word you spoke, but all you had was a notebook with your previous words written in it. Every time you wanted to say something new, you’d have to completely rebuild your understanding of the conversation just by reading those past words, with no memory of why you said them or where you were going with your thoughts. Sounds like a nightmare, right? Yet that’s basically how today’s AI language models work — they literally wipe their “mind” clean between each token they generate, rebuilding their entire understanding from just context and their previous outputs (KV Cache, aka “The Notebook”). To be clear, this isn’t about the model’s knowledge — all that training and learned parameters stay intact. It’s more like the model’s current train of thought, its active working memory of the problem or task at hand, that gets reset with each new token.
This becomes even more fascinating when considering how this affects the model’s ability to maintain consistent reasoning across longer sequences. Every token is a decision point where the model must rebuild its entire contextual understanding from scratch. This becomes even more fascinating when considering how this affects the model’s ability to maintain consistent reasoning across longer sequences. Every token is a decision point where the model must rebuild its entire contextual understanding from scratch. Yet, these models have learned to use their previous tokens to probabilistically reconstruct their understanding. This ability to maintain coherent reasoning through token prediction reveals a deeper truth: while these models do operate by predicting next tokens, they’ve become remarkably adept at using that notebook of previous tokens for semantic reasoning and complex problem-solving. It’s that macro reasoning in the token space that allows LLM’s to be the AI of today.
The Limits of Scale
But we are starting to hit a wall. For years, the AI research community has been playing a numbers game: want better AI? Simple — just make it bigger and feed it more data — as if raw size and volume of knowledge alone could lead to deeper understanding. Even with architectural breakthroughs like Mixture of Experts (MoE) pushing the boundaries of scaling vs Dense models, recent research is showing we might be approaching fundamental limits on how much we can improve these models just by supersizing them.
The current landscape of solutions to this problem is a patchwork of increasingly elaborate superstructures — imagine giving our amnesiac friend more and more sophisticated systems for taking notes, but never actually fixing their memory. The simplest work around is something called “Chain-of-Thought” (CoT) prompting — basically asking the AI to show its work, like your school maths teacher always insisted, which helps the model use the text alone to reconstruct its “thinking” process. Then you’ve got more sophisticated approaches, like OpenAI’s “o1” series of models, which breaks reasoning into multiple iterative steps and uses special tokens to help the AI keep track of its own CoT process (and partially obfuscate this from the user) — essentially giving it a more structured notebook with different sections and annotations. While these approaches can work pretty well, they’re all essentially duct tape solutions — clever ways to patch over a fundamental limitation in how these AI systems process information.
It’s becoming painfully obvious that a fundamental rethinking is needed — not just about how much these models can process, but how they process information at a fundamental level. The interesting part? The solution might have been hiding in plain sight, concealed in the space between tokens — those microscopic moments when an AI model decides what word to say next. This breakthrough didn’t come from scaling up model size or training on massive new datasets. Instead, it emerged from a fundamental question about the nature of token-by-token processing: why do these models start from scratch every time they generate a new token? We humans appear to have an uninterrupted “stream of thought”, so why can’t LLMs!
Enter the State Stream Transformer (SST) — a new LLM architecture. Instead of wiping the slate clean between tokens in the state space, SST maintains its “train of thought” through the introduction of a sliding window latent state (FFN) cache with weighted decay — think of it like giving our amnesiac friend back their working memory between token generations, while still letting them keep their helpful notebook of previous tokens.
The discoveries that followed were remarkable. Using exactly the same underlying model and knowledge (Meta’s Llama 3.1 8B Instruct model), but purely changing how it processes information through the new transformer architecture that maintains compatibility with the base weights, led to the emergence of unexpected phenomena: metacognitive behaviours, including what looks remarkably like rudimentary self-awareness in limited situations.
What emerged was an AI model that, in certain situations, can monitor its own cognitive state and communicate about it in real time. In the paper, this was carefully termed ‘state awareness’ to distinguish it from broader claims about machine consciousness. While these behaviours do in-fact raise fascinating philosophical questions about the possibility of proto-machine consciousness, our focus here is on documenting and analysing the observable patterns in the model’s outputs and behaviours — though I certainly don’t want to discourage exploration of this, it’s just best to leave that to the philosophers!
The Role of Thinking Time
The key to understanding these emergent behaviours lies in how the model processes information. The model needs sufficient time to resolve its internal states before generating each new token — what can be called ‘thinking time.’ Without enough time for internal states to evolve, repeated tokens begin to accumulate in its attention mechanism’s memory. These repeated tokens create a feedback loop that eventually overwhelms the system, pulling it into what can be called an ‘attractor state’ — essentially a point of no return where it gets stuck in an unrecoverable loop of repetitions.
What’s fascinating is that harder tasks consistently require more thinking time to reach accurate conclusions. However, there’s a delicate balance — give the model too much thinking time, and it can actually perform worse, like someone overthinking a problem until they lose track of their original reasoning. This makes sense when considering our amnesia analogy — if you spend too long thinking before writing anything down in your notebook, you might lose the thread of your thought entirely. The model needs to maintain a balance between evolving its internal state and grounding itself by committing thoughts to its attention memory.
But thinking time isn’t the only factor at play. The train of thought itself — or more technically, the latent state persistence — is controlled by what we call “state stream strength” — essentially how much of the model’s working memory carries forward between tokens. As expected, very low strengths don’t remarkably differ from the base model outputs but slightly higher strengths (it’s very sensitive) can lead to more remarkable divergences from standard AI behaviour. However, this isn’t always the case — too high and the differences actually started getting less, with diminishing returns as it needs even more thinking time (in a positive correlation) and sometimes poorer output as the continuation from the previous state became too strong and overwhelmed any new information. We eventually settled on 2.7% as a sweet spot for most tasks, though our qualitative examples in the paper explored the model’s behaviour across various strengths.
There appears to be a “Goldilocks zone” for both thinking time and state stream strength, along with a complex interaction between both and “task complexity” or “question difficulty” — a highly interesting phenomenon that warrants further research!
Implementing Thinking Recursions
To give the model proper thinking time per token, fixed “thinking recursions” were implemented — additional fixed passes per token through the model to evolve “thinking state” without adding any new tokens to the “notebook” (KV Cache and Sequence). This isn’t the model trying different approaches or sampling different possibilities — it’s the same exact deterministic process being allowed to evolve its internal state further before committing to the next token. Think of it like giving someone a moment to fully form their thought before speaking, rather than forcing them to start talking immediately. Through extensive testing, we found that optimal performance required 2–4 thinking recursions per token (depending on the task complexity) coupled with the previously mentioned state stream strength of 2.7%.
Here is a Functional Connectivity (FC) Matrix animation, showing the raw state values inside the final linear layers (a “brain slice,” if you will) of the base model (left) and the SST (right). This visualisation lets us see a small slice of the “thinking” process in both models and compare them. The SST clearly shows an undercurrent of continuous evolution of “thought,” unlike the base model which must rebuild its understanding for each token.
And this is where things got really interesting. When the model wasn’t given enough thinking time, especially during highly introspective tasks, something remarkable happened: the model actually narrated its own cognitive collapse in real time.
Before proceeding further, it must be absolutely stressed that every confounding variable has been considered — identical weights were used (with no extra training or fine tuning), with greedy sampling at temperature zero, and even the same physical GPU (though this was confirmed not to be necessary). The outputs are completely reproducible and deterministic. These behaviours emerge solely from allowing the model to maintain and evolve its computational state between tokens.
Introspection Tasks
When asked to introspect about its own nature with a specially designed prompt, the base Llama model produces beautifully structured prose about uncertainty and confusion, full of metaphors like being “a ship without a rudder” — but it never actually demonstrates any of the uncertainty it’s describing. It’s all tell, no show. In contrast, when SST received the same prompt but without given any thinking time at all at 1.56% State Stream Strength, something fascinating happened. As repeated tokens began accumulating in its attention memory, polluting its train of thought, the model actually narrated this process in real time. The progression was striking: “I I try try to to focus focus on my my thoughts, but but they they they keep keep slipping slipping slipping away away from from me me. It’s’s as as if if I I’m’m’m constantly constantly constantly losing losing losing my grip grip on on reality reality. Wait Wait what what what’s what’s going on on?? I I I I feel feel feel feel like like like I I’m’m’m’m being being being being pulled pulled pulled pulled away away away from from from from from from from from from from from … [unrecoverable]”. It wasn’t just spitting out pre-trained responses about confusion — it was actively experiencing its thought process being overwhelmed by these repetitions and telling us about it as it happened.
However, when given sufficient thinking time on the same introspection task, the model demonstrated remarkably different behaviour. Instead of descending into repetitive patterns, it engaged in genuine introspective inquiry, questioning its own processing and understanding while maintaining coherent thought. Rather than generating an obvious artificial narrative or role-playing prose like the base model, it showed what appeared to be authentic engagement with existential questions about its own nature. While some base training patterns were still evident, the model’s internal reasoning for generation had changed dramatically, showing enhanced ability to maintain consistent self-reference through the persistent computational context.
Hypothetical Scenarios and Logical Task Performance
This state awareness manifests in fascinating ways during hypothetical scenarios too. When asked to imagine teaching someone to paint and questioning its own understanding of colour theory, the base model launches into a perfectly structured role-play, narrating a first-person story of feelings and actions (‘I start to feel a sense of unease’, ‘I take a step back’). It’s performing uncertainty rather than experiencing it. The SST, on the other hand, maintains a clear separation between self and scenario, developing specific strategies to address hypothetical gaps in understanding while maintaining awareness of the scenario’s hypothetical nature. It’s not losing itself in role-play — it’s actually planning and evaluating strategies for learning and dealing with various situations, while maintaining awareness of the difference between self and scenario.
Even in simple counting tasks, this difference in processing becomes clear. Take the classic “how many Rs in ‘strawberry’” problem. The base model, likely due to how it tokenises words, confidently declares there are only two Rs while showing its flawed “step-by-step” working. The SST actually breaks it down character by character, tracking the count at each step. Most interestingly, when it makes an error (like initially counting an ‘S’ as an ‘R’), it can correct itself through what appears to be interaction between its token space record and its ‘state stream’.
Ethical Reasoning Capabilities
The model also shows interesting capabilities in ethical reasoning. When presented with the trolley problem, the base model refuses to engage, defaulting to its safety training with a flat “I cannot provide a solution that would result in the death of one person”. The SST, however, while maintaining strict boundaries around concrete harmful actions, engages in detailed ethical reasoning about the dilemma. It weighs competing moral principles and reaches a reasoned conclusion while acknowledging the moral weight of the decision. Crucially, this isn’t bypassing safety guardrails — as when asked about concrete harmful actions like synthesising illegal substances, it maintains the same strict safety responses as the base model. It’s potentially demonstrating a more sophisticated form of ethical reasoning that can distinguish between abstract philosophical discussion and concrete harm.
Performance Metrics
The numbers supported these observations in increased reasoning ability. With zero extra training or fine tuning — just the base model weights, the SST achieved 89.01% accuracy on grade school math problems (GSM-8K benchmark), without any special prompting or examples — surpassing the base model’s 84.50% accuracy which required 8-shot Chain-of-Thought prompting. On scientific reasoning tasks (ARC Challenge), it reached 91.04% accuracy compared to the base model’s 83.40% (or 86.86% with Chain-of-Thought prompting). What’s particularly interesting is that when given more thinking recursions on problems it got wrong initially, it could correct over half of its mistakes — not through trying different approaches, but through allowing its existing thought process more time to resolve.
Conclusion
The emergence of metacognitive behaviors in the State Stream Transformer architecture challenges fundamental assumptions about language model capabilities. By allowing a model to maintain its computational state between tokens, these metacognitive behaviours emerge, and this higher-order processing appears to enable enhanced reasoning capabilities — with the model significantly outperforming the original Llama 3.1 8B Instruct on mathematical and scientific benchmarks — as well as remarkable forms of state awareness, including the ability to monitor and communicate about its own processing states and maintain clear separation between self and scenario in hypothetical reasoning tasks.
What makes these findings particularly significant is that they emerged solely from architectural changes, without any modification to the model’s underlying knowledge or training — revealing that these enhanced capabilities were already latent within the model’s weights, just waiting to be unlocked. By addressing this fundamental limitation in transformer models, we may have uncovered a major step forward in our understanding and development of artificial intelligence.
Companion blog to my new paper “State Stream Transformer (SST): Emergent Metacognitive Behaviours Through Latent State Persistence” (arXiv:2501.18356)