Google’s TurboQuant Compression May Support Faster Inference, Same Accuracy On Less Capable Hardware

Google Research unveiled TurboQuant, a novel quantization algorithm that compresses large language models’ Key-Value caches by up to 6x. With 3.5-bit compression, near-zero accuracy loss, and no retraining needed, it allows developers to run massive context windows on significantly more modest hardware than previously required. Early community benchmarks confirm significant efficiency gains.

While the rationale for quantization may seem logical, the difficulty is, given a number of encoding bits, to maintain the accuracy of inference-relevant computations (e.g., inner products, cosine similarity, distances) with compressed data.

The research team claims that TurboQuant can compress the KV cache down to 3.5 bits per value with near-zero accuracy loss. On standard benchmarks like LongBench and Needle in a Haystack, a 3.5-bit TurboQuant implementation matched the performance of full 16-bit precision across Gemma and Mistral models.

TurboQuant uses a two-step approach. First, data vectors are rotated (randomized Hadamard transform). This keeps key Euclidean properties (e.g., distance) while spreading out the values, removing the outlier-heavy coordinate distribution that makes low-bit quantization difficult. Post-transform, the vector coordinates follow a beta distribution that is more amenable to compression with low distortion. Second, a decade-old technique, the Quantized Johnson-Lindenstrauss (QJL) transform, is applied to remove the bias created by the first step. Post-QJL, the paper argues that inner products between quantized vectors are unbiased, computationally efficient, accurate estimators of the unquantized vectors, with the resulting effect of maintaining inference accuracy.

Early community analysis seems to confirm significant gains, albeit more modest than those reported in the paper. The Two Minute Papers analysis suggests more realistic, “real-world” improvements of 30-40% in memory reduction and processing speed:

Based on the results, we cannot conclude that every AI machine suddenly needs 6 times less RAM. No. That is a bit idealistic and only true for some corner cases. You know when you see an official benchmark of a phone battery or electric car mileage with somewhat idealized conditions? It is a bit like that.

So careful with the media hype. […] We wait for more data and analyze experiments here, to get the highest quality information.

But it’s still good. Really good! It helps most people who run AI systems with very long contexts. When you chuck in a huge PDF document, or a movie, or a huge codebase for an AI to analyze. Yes, you will be able to do that cheaper, with meaningfully less memory. Often a few gigabytes less. And I think that is absolutely amazing news.

A fundamental optimization in LLM inference is the caching of computations that are required repeatedly. This is particularly critical during autoregressive generation, where each newly generated token uses data already computed during the generation of all previous tokens. By caching these Key and Value tensors (the KV cache), the system avoids redundant, computationally expensive passes over the entire sequence history.

However, the efficiency gains offered by caching come with a significant memory cost that grows linearly with the token sequence length. For LLMs designed with long context windows, the massive VRAM footprint of the cache eventually outweighs the memory required for the model weights themselves.

For example, according to Darshan Fofadiya, AI researcher at Amazon,running a Llama 70B model with a 1M-token context window may require approximately 328GB of VRAM just for the KV cache. When compared to the 140 GB required to hold the 70B model weights in BF16, the cache becomes the primary barrier to deployment, forcing engineers into costly multi-GPU configurations. Compressed down from 16 to 3.5 bits, the cache then requires 72 GB and fits on a single H100 (80 GB HBM).

During the inference decoding phase, certain input tokens in a prompt produce KV vectors with magnitudes in the hundreds or thousands, while the majority of other tokens have values close to 0-1. In LLaMA-2-7B, for instance, the top 1% of KV cache values may have magnitudes that are 10-100x larger than the median value. This massive distribution skew makes linear 4-bit quantization impossible without specialized techniques, as the outliers stretch the quantization grid and crush the precision of normal tokens.

Generative inference with LLMs is, for relatively small batch sizes, memory-bound. With memory speeds growing slower than compute speeds, reducing the memory bottleneck (the so-called memory wall) is key to efficient inference. For short contexts, weight matrices are the dominant contributor to memory consumption. For long contexts, the KV cache becomes the main contributor. Quantization techniques for both model weights and the KV cache are thus instrumental in speeding up inference and constitute a major research topic.