Introduction
Today, Inception Labs released the first commercially available Diffusion Large Language Model (dLLM) – Mercury Coder, and caused a big stir both in the research community as well as in the AI industry. In contrast to auto-regression LLMs (all the LLMs you know today), diffusion LLM works like your favorite AI image generators such as Stable Diffusion, where the final results emerge from a cloud of gibberish text. See one example below for the visualization of asking Mercury Coder to write a Python program to split an image into halves:
Key Points
- Research suggests diffusion LLMs are a new type of language model using diffusion techniques, potentially faster and more efficient than auto-regressive models.
- Inception Labs launched Mercury Coder, a commercial-scale diffusion LLM, claiming speeds over 1000 tokens/second, 5-10x faster than competitors.
- It seems likely that diffusion LLMs could challenge auto-regressive tech, offering new capabilities like improved reasoning and controllability, but their full impact is still emerging.
- Andrej Karpathy and Andrew Ng, both renowned AI researchers, have enthusiastically welcomed the arrival of Inception Lab’s diffusion LLM.
Understanding Diffusion LLMs
Diffusion LLMs represent a novel approach to language modeling, leveraging diffusion techniques traditionally used in generative models for continuous data like images and video. The concept is rooted in the idea of starting with a noisy version of the data and iteratively denoising it to produce the desired output. For text, this involves a forward process of masking tokens and a reverse process of predicting these masked tokens, optimized to maximize a likelihood bound.
A significant recent development is the paper “Large Language Diffusion Models” by Shen Nie and others, published on February 14, 2025, introducing LLaDA. This model is trained from scratch under a pre-training and supervised fine-tuning (SFT) paradigm, using a vanilla Transformer to predict masked tokens.
LLaDA demonstrates strong scalability, outperforming auto-regressive model (ARM) baselines and being competitive with LLaMA3 8B in in-context learning and instruction-following abilities, such as multi-turn dialogue. Notably, it addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task (Large Language Diffusion Models).
This approach contrasts with auto-regressive LLMs, which generate text token by token, each dependent on previous tokens, leading to sequential processing that can be slow and computationally expensive for long sequences. Diffusion LLMs, by enabling parallel token generation, offer potential advantages in speed and efficiency, which could revolutionize language generation tasks.
Comparative Context and Historical Trends
To contextualize, auto-regressive LLMs have dominated since the rise of models like GPT-3, with revenues and adoption growing rapidly, as seen in Nvidia’s recent Q4 FY25 results with strong AI chip demand. Diffusion LLMs, while newer, build on the success of diffusion models in image generation, like Stable Diffusion, suggesting a potential transition of technology. The table below compares key attributes:
Attribute |
Auto-Regressive LLMs |
Diffusion LLMs |
---|---|---|
Generation Method |
Sequential, token by token |
Parallel, coarse-to-fine |
Speed |
Slower, ~100 tokens/sec |
Faster, >1000 tokens/sec |
Efficiency |
Higher computational cost |
Lower cost, claims 10x |
Controllability |
Limited |
Enhanced, error correction |
Scalability |
Well-established |
Emerging, needs validation |
This comparison highlights the potential for diffusion LLMs to disrupt, but their success depends on overcoming current limitations.
Insights From Leading Researchers
__Andrej Karpathy__wrote on X/Twitter today:
“This is interesting as a first large diffusion-based LLM.
Most of the LLMs you’ve been seeing are ~clones as far as the core modeling approach goes. They’re all trained “autoregressively”, i.e. predicting tokens from left to right. Diffusion is different – it doesn’t go left to right, but all at once. You start with noise and gradually denoise into a token stream.
Most of the image / video generation AI tools actually work this way and use Diffusion, not Autoregression. It’s only text (and sometimes audio!) that have resisted. So it’s been a bit of a mystery to me and many others why, for some reason, text prefers Autoregression, but images/videos prefer Diffusion. This turns out to be a fairly deep rabbit hole that has to do with the distribution of information and noise and our own perception of them, in these domains. If you look close enough, a lot of interesting connections emerge between the two as well.
All that to say that this model has the potential to be different, and possibly showcase new, unique psychology, or new strengths and weaknesses. I encourage people to try it out!”
Andrew Ng said:
“Transformers have dominated LLM text generation, and generate tokens sequentially. This is a cool attempt to explore diffusion models as an alternative, by generating the entire text at the same time using a coarse-to-fine process.”
Future Implications Compared to Auto-Regressive Technology
The emergence of diffusion LLMs, exemplified by Mercury Coder and LLaDA, portends significant changes for the future of language modeling compared to the dominant auto-regressive technology. Auto-regressive models, powering mainstream LLMs like ChatGPT and Claude, generate text sequentially, which can lead to high inference costs and latency, especially for complex tasks.
Diffusion LLMs, with their parallel generation capabilities, offer a potential paradigm shift.
Key potential advantages include:
- Speed and Efficiency: Mercury Coder’s claimed 1000+ tokens/second, 5-10x faster than competitors, suggesting diffusion LLMs could significantly reduce latency, making them ideal for real-time applications like chatbots and coding assistants.
- Quality and Controllability: The ability to refine outputs and generate tokens in any order could lead to fewer hallucinations and better alignment with user objectives, as noted by Inception Labs. LLaDA’s competitive performance in instruction-following and addressing the reversal curse further supports this.
- New Capabilities: Diffusion LLMs might enable advanced reasoning and agentic applications, leveraging error correction and parallel processing, which could open new use cases not feasible with auto-regressive models.
However, challenges remain, including training complexity, scalability to very large models, and interpretability, which could affect adoption. The evidence leans toward diffusion LLMs coexisting with auto-regressive models, each suited for different tasks, but their long-term impact is still emerging. There is some controversy, with concerns about whether diffusion LLMs can scale as effectively as auto-regressive models and handle diverse language tasks, as seen in discussions around competitive models like DeepSeek claiming efficiency with less compute.
Conclusion
Diffusion LLMs, with Mercury Coder as a pioneering commercial example, represent a promising advancement in language modeling, offering speed, efficiency, and new capabilities compared to auto-regressive technology. While their full impact is still unfolding, they could challenge the status quo, potentially coexisting or replacing current models.
Experts like Karpathy and Ng suggest a future where diffusion LLMs play a significant role, though further research is needed to validate their scalability and performance.