A new research paper from Apple details a technique that speeds up large language model responses, while preserving output quality. Here are the details.
The nerdy bits
Traditionally, LLMs generate text one token at a time. This is slow because each step depends on all the previous ones to keep the output coherent and accurate.
If the model is writing a sentence like “The cat is black
”, it predicts each token in sequence. After writing “The cat is
”, it looks at everything so far (plus the user’s request, and patterns it learned during training) to calculate the probability of every possible next token in its vocabulary. That’s called autoregression.
In this scenario, it might rank options like black
, tall
, sleeping
, grumpy
, fluffy
, skinny
, purring
, white
, tired
, playing
, missing
, meowing
, cold
, and so on, then choose the one that best fits the context.
What Apple did
In the study Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential, Apple’s team found that even though these models are usually trained to predict just the next token, they still carry useful information about several upcoming tokens.
Building on that, they developed a “multi-token prediction” (MTP) framework that lets the model produce multiple tokens at once.
If this sounds a bit like the diffusion model study we covered a few weeks ago, you’re not that far off. While the training process and the underlying technologies differ, both approaches aim at speeding up inference and getting to the result faster than with the one-token-at-a-time approach.
In this particular study, the researchers inserted special “mask” tokens into prompts, which are basically placeholders for upcoming words.
For example, “The cat is <MASK1> <MASK2>
” might get filled in as “very fluffy
” in a single step. As it writes, the model speculates on several upcoming words at once, with each word being immediately verified against what standard autoregressive decoding would have produced. If a guess doesn’t pass the check, it reverts to the regular one-at-a-time process. All in all, this ensures extra speed, without sacrificing accuracy.
In testing with the open-source Tulu3-8B model, Apple trained the model to speculatively predict 8 additional tokens, and reported average speedups of 2–3× across general tasks like Q&A and chat, and up to 5× for more predictable domains like coding and math. The gains came with “no degradation in generation quality, thanks to a simple yet effective technique we call gated LoRA adaptation.”
You can read the full paper on arXiv.
Limited Mac deals on Amazon
FTC: We use income earning auto affiliate links. More.