By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Apple researchers taught an LLM to predict tokens up to 5x faster – 9to5Mac
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Apple researchers taught an LLM to predict tokens up to 5x faster – 9to5Mac
News

Apple researchers taught an LLM to predict tokens up to 5x faster – 9to5Mac

News Room
Last updated: 2025/08/09 at 2:07 AM
News Room Published 9 August 2025
Share
SHARE

A new research paper from Apple details a technique that speeds up large language model responses, while preserving output quality. Here are the details.

The nerdy bits

Traditionally, LLMs generate text one token at a time. This is slow because each step depends on all the previous ones to keep the output coherent and accurate.

If the model is writing a sentence like “The cat is black”, it predicts each token in sequence. After writing “The cat is”, it looks at everything so far (plus the user’s request, and patterns it learned during training) to calculate the probability of every possible next token in its vocabulary. That’s called autoregression.

In this scenario, it might rank options like black, tall, sleeping, grumpy, fluffy, skinny, purring, white, tired, playing, missing, meowing, cold, and so on, then choose the one that best fits the context.

What Apple did

In the study Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential, Apple’s team found that even though these models are usually trained to predict just the next token, they still carry useful information about several upcoming tokens.

Building on that, they developed a “multi-token prediction” (MTP) framework that lets the model produce multiple tokens at once.

If this sounds a bit like the diffusion model study we covered a few weeks ago, you’re not that far off. While the training process and the underlying technologies differ, both approaches aim at speeding up inference and getting to the result faster than with the one-token-at-a-time approach.

In this particular study, the researchers inserted special “mask” tokens into prompts, which are basically placeholders for upcoming words.

For example, “The cat is <MASK1> <MASK2>” might get filled in as “very fluffy” in a single step. As it writes, the model speculates on several upcoming words at once, with each word being immediately verified against what standard autoregressive decoding would have produced. If a guess doesn’t pass the check, it reverts to the regular one-at-a-time process. All in all, this ensures extra speed, without sacrificing accuracy.

In testing with the open-source Tulu3-8B model, Apple trained the model to speculatively predict 8 additional tokens, and reported average speedups of 2–3× across general tasks like Q&A and chat, and up to 5× for more predictable domains like coding and math. The gains came with “no degradation in generation quality, thanks to a simple yet effective technique we call gated LoRA adaptation.”

You can read the full paper on arXiv.

Limited Mac deals on Amazon

FTC: We use income earning auto affiliate links. More.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Meet the ‘new Evel Knievel’ who shattered world record with 205ft-jump
Next Article Introducing the Authority Insights Podcast and Newsletter
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Agentic AI reshapes cybersecurity beyond the hype – News
News
Inclusive design is key to onboarding Nigeria’s unbanked
Computing
Final days for Americans to get up to $10k from $1.4m data breach settlement
News
The Best Open Earbuds for Hearing the World
Gadget

You Might also Like

News

Agentic AI reshapes cybersecurity beyond the hype – News

6 Min Read
News

Final days for Americans to get up to $10k from $1.4m data breach settlement

5 Min Read
News

I went camping in a heat dome, and these five gadgets saved my vacation

12 Min Read
News

Britain’s most stolen phones in 2025 revealed

6 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?