By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Google’s TurboQuant Compression May Support Faster Inference, Same Accuracy on Less Capable Hardware
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > News > Google’s TurboQuant Compression May Support Faster Inference, Same Accuracy on Less Capable Hardware
News

Google’s TurboQuant Compression May Support Faster Inference, Same Accuracy on Less Capable Hardware

News Room
Last updated: 2026/04/15 at 1:04 PM
News Room Published 15 April 2026
Share
Google’s TurboQuant Compression May Support Faster Inference, Same Accuracy on Less Capable Hardware
SHARE

Google Research unveiled TurboQuant, a novel quantization algorithm that compresses large language models’ Key-Value caches by up to 6x. With 3.5-bit compression, near-zero accuracy loss, and no retraining needed, it allows developers to run massive context windows on significantly more modest hardware than previously required. Early community benchmarks confirm significant efficiency gains.

While the rationale for quantization may seem logical, the difficulty is, given a number of encoding bits, to maintain the accuracy of inference-relevant computations (e.g., inner products, cosine similarity, distances) with compressed data.

The research team claims that TurboQuant can compress the KV cache down to 3.5 bits per value with near-zero accuracy loss. On standard benchmarks like LongBench and Needle in a Haystack, a 3.5-bit TurboQuant implementation matched the performance of full 16-bit precision across Gemma and Mistral models.

TurboQuant uses a two-step approach. First, data vectors are rotated (randomized Hadamard transform). This keeps key Euclidean properties (e.g., distance) while spreading out the values, removing the outlier-heavy coordinate distribution that makes low-bit quantization difficult. Post-transform, the vector coordinates follow a beta distribution that is more amenable to compression with low distortion. Second, a decade-old technique, the Quantized Johnson-Lindenstrauss (QJL) transform, is applied to remove the bias created by the first step. Post-QJL, the paper argues that inner products between quantized vectors are unbiased, computationally efficient, accurate estimators of the unquantized vectors, with the resulting effect of maintaining inference accuracy.

Early community analysis seems to confirm significant gains, albeit more modest than those reported in the paper. The Two Minute Papers analysis suggests more realistic, “real-world” improvements of 30-40% in memory reduction and processing speed:

Based on the results, we cannot conclude that every AI machine suddenly needs 6 times less RAM. No. That is a bit idealistic and only true for some corner cases. You know when you see an official benchmark of a phone battery or electric car mileage with somewhat idealized conditions? It is a bit like that.

So careful with the media hype. […] We wait for more data and analyze experiments here, to get the highest quality information.

But it’s still good. Really good! It helps most people who run AI systems with very long contexts. When you chuck in a huge PDF document, or a movie, or a huge codebase for an AI to analyze. Yes, you will be able to do that cheaper, with meaningfully less memory. Often a few gigabytes less. And I think that is absolutely amazing news.

A fundamental optimization in LLM inference is the caching of computations that are required repeatedly. This is particularly critical during autoregressive generation, where each newly generated token uses data already computed during the generation of all previous tokens. By caching these Key and Value tensors (the KV cache), the system avoids redundant, computationally expensive passes over the entire sequence history.

However, the efficiency gains offered by caching come with a significant memory cost that grows linearly with the token sequence length. For LLMs designed with long context windows, the massive VRAM footprint of the cache eventually outweighs the memory required for the model weights themselves.

For example, according to Darshan Fofadiya, AI researcher at Amazon,running a Llama 70B model with a 1M-token context window may require approximately 328GB of VRAM just for the KV cache. When compared to the 140 GB required to hold the 70B model weights in BF16, the cache becomes the primary barrier to deployment, forcing engineers into costly multi-GPU configurations. Compressed down from 16 to 3.5 bits, the cache then requires 72 GB and fits on a single H100 (80 GB HBM).

During the inference decoding phase, certain input tokens in a prompt produce KV vectors with magnitudes in the hundreds or thousands, while the majority of other tokens have values close to 0-1. In LLaMA-2-7B, for instance, the top 1% of KV cache values may have magnitudes that are 10-100x larger than the median value. This massive distribution skew makes linear 4-bit quantization impossible without specialized techniques, as the outliers stretch the quantization grid and crush the precision of normal tokens.

Generative inference with LLMs is, for relatively small batch sizes, memory-bound. With memory speeds growing slower than compute speeds, reducing the memory bottleneck (the so-called memory wall) is key to efficient inference. For short contexts, weight matrices are the dominant contributor to memory consumption. For long contexts, the KV cache becomes the main contributor. Quantization techniques for both model weights and the KV cache are thus instrumental in speeding up inference and constitute a major research topic.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Ignite On Tour Madrid 2026 marks the roadmap of Palo Alto Networks Ignite On Tour Madrid 2026 marks the roadmap of Palo Alto Networks
Next Article Intel Arc Pro B70 Open-Source Linux Performance Against NVIDIA RTX & AMD Radeon AI PRO Review Intel Arc Pro B70 Open-Source Linux Performance Against NVIDIA RTX & AMD Radeon AI PRO Review
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

The AI Grid Is Leaving Earth
The AI Grid Is Leaving Earth
News
Ukraine’s enhanced fortifications are increasing the cost of Putin’s invasion
Ukraine’s enhanced fortifications are increasing the cost of Putin’s invasion
News
Replace Expensive Cloud Subscriptions With a One-Time 5TB Storage Solution, Now 80% Off
Replace Expensive Cloud Subscriptions With a One-Time 5TB Storage Solution, Now 80% Off
News
Social Media Analytics Tools for Better Marketing Insights |
Computing

You Might also Like

The AI Grid Is Leaving Earth
News

The AI Grid Is Leaving Earth

21 Min Read
Ukraine’s enhanced fortifications are increasing the cost of Putin’s invasion
News

Ukraine’s enhanced fortifications are increasing the cost of Putin’s invasion

7 Min Read
Replace Expensive Cloud Subscriptions With a One-Time 5TB Storage Solution, Now 80% Off
News

Replace Expensive Cloud Subscriptions With a One-Time 5TB Storage Solution, Now 80% Off

4 Min Read
FCC hands Netgear an effective monopoly on router sale in the US
News

FCC hands Netgear an effective monopoly on router sale in the US

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?