Unleashing LLM Speed: Multi-Token Self-Speculative Decoding Redefines Inference

Unleashing LLM Speed: Multi-Token Self-Speculative Decoding Redefines Inference | HackerNoon

Last updated: 2025/07/20 at 1:10 PM

News Room Published 20 July 2025

Table of Links

Abstract and 1. Introduction

2. Method

3. Experiments on real data

4. Ablations on synthetic data

5. Why does it work? Some speculation

6. Related work

7. Conclusion, Impact statement, Environmental impact, Acknowledgements and References

A. Additional results on self-speculative decoding

B. Alternative architectures

C. Training speeds

D. Finetuning

E. Additional results on model scaling behavior

F. Details on CodeContests finetuning

G. Additional results on natural language benchmarks

H. Additional results on abstractive text summarization

I. Additional results on mathematical reasoning in natural language

J. Additional results on induction learning

K. Additional results on algorithmic reasoning

L. Additional intuitions on multi-token prediction

M. Training hyperparameters

A. Additional results on self-speculative decoding

Table S3: Relative speedups with self-speculative decoding with byte-level models on code. We prompt the 7B parameter models from Section 3.3 on 4096 sequences of 1024 bytes of code not seen during training, and generate completions consisting of 1024 bytes using greedy self-speculative decoding (Stern et al., 2018) as in Table S2. The speedup was evaluated at a batch size of 16.

:::info
Authors:

(1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech and Equal contribution;

(2) Badr Youbi Idrissi, FAIR at Meta, LISN Université Paris-Saclayand and Equal contribution;

(3) Baptiste Rozière, FAIR at Meta;

(4) David Lopez-Paz, FAIR at Meta and a last author;

(5) Gabriel Synnaeve, FAIR at Meta and a last author.

:::

:::info
This paper is available on arxiv under CC BY 4.0 DEED license.

:::

Unleashing LLM Speed: Multi-Token Self-Speculative Decoding Redefines Inference | HackerNoon

Table of Links

A. Additional results on self-speculative decoding

Leave a Reply Cancel reply

Stay Connected

Latest News

How An AI Chatbot Unlike Any Other Challenged My Swiftie Knowledge

Apple is about to give more generous payouts from its Bug Bounty Program

No Recurring Fees: Licenses for Windows 11 Pro and Office 2021 Are Under $45

Here’s What It Means for Users

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

A. Additional results on self-speculative decoding

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News