Google published a compression algorithm that shrinks AI memory usage by 6x with zero accuracy loss. Memory chip stocks immediately dropped up to 6%. And the deeper you look at it, the more significant it gets.
WHAT THE PROBLEM WAS
Running a large language model requires storing something called a KV cache - a "digital cheat sheet" that holds intermediate computations so the model does not have to recalculate them constantly.
The KV cache is one of the biggest memory bottlenecks in AI inference. As models get longer context windows (meaning they can process more text at once), the KV cache gets proportionally larger. This has been a hard ceiling on what you can run, how fast, and at what cost.
Standard KV cache values are stored at 16 bits of precision. Sometimes 32. Every value, across every layer, for every token in the context. For a frontier model processing a 128,000 token context window, this is an enormous amount of memory.
WHAT TURBOQUANT IS
Google Research announced TurboQuant on March 24, 2026. It will be presented at ICLR 2026.
TurboQuant compresses KV cache values from 16 bits down to 3 bits per value - with zero accuracy loss. No model retraining required. No fine-tuning. Plug it in, it works.
How it achieves this:
Two algorithms working together:
PolarQuant - converts data vectors to polar coordinates before compression. This geometric trick simplifies the data structure enough that standard quantization can be applied without the usual normalization overhead that eats up extra bits. This stage handles most of the compression.
QJL (Quantized Johnson-Lindenstrauss) - uses just 1 additional bit to run an error-checking pass on whatever compression error PolarQuant introduced. This eliminates bias and preserves the mathematical relationships between values that models depend on for accurate attention scores.
The result: 3-bit compression with no meaningful accuracy degradation on downstream tasks including question answering, code generation, and summarization.
THE ACTUAL NUMBERS
Tested on Gemma, Mistral, and Llama-3.1-8B on Nvidia H100 GPUs:
6x average reduction in KV cache memory
8x speedup in computing attention logits
Zero accuracy loss across benchmarks including Needle-in-a-Haystack (long context retrieval)
50%+ cost savings estimated for enterprises running LLMs at scale
The 8x speedup on attention computation matters because attention is one of the most compute-intensive operations in a transformer model. Faster attention means faster inference, which means cheaper API calls, more requests per second per GPU, and the ability to run longer contexts without hitting hardware limits.
WHAT IT MEANS PRACTICALLY
For inference costs: Running frontier models gets 50%+ cheaper per query. This is not a marginal improvement. It applies to every API call, every chatbot response, every agentic workflow.
For context windows: Models that previously could not sustain 128K or 256K context windows due to memory constraints can now run longer contexts on the same hardware. This enables new use cases: analyzing entire codebases in one shot, processing full legal documents, maintaining longer conversation histories.
For hardware: The same GPU that was previously maxed out running one large model can now run significantly more, or run bigger models. This is a direct efficiency multiplier on existing Nvidia infrastructure.
For Google specifically: Gemini's serving costs drop. Google Search's AI inference gets cheaper. The economic advantage of already owning massive H100 infrastructure becomes more leveraged when each GPU does more work per dollar.
THE SILICON VALLEY PIED PIPER COMPARISON
TechCrunch compared TurboQuant to Pied Piper from the HBO show Silicon Valley - the fictional compression algorithm that was supposed to change everything. The comparison is partly a joke and partly accurate. The show's premise was that a breakthrough compression algorithm would upend the entire tech industry. Google just published something that might actually do that.
THE CHIP STOCK CRASH
Memory chip stocks dropped immediately on the announcement:
SK Hynix: -6%
Samsung: -5%
SanDisk: -5.7%
Western Digital: -4.7%
Micron: -3%
The market logic: if AI models need 6x less memory to run, demand for memory chips falls proportionally.
Why analysts think this is an overreaction:
The Jevons Paradox argument is the main counterpoint. Named after 19th century economist William Stanley Jevons, it describes what happens when efficiency increases: you don't use less of the resource, you use more, because the lower cost enables new applications that were previously uneconomical.
Applied here: cheaper inference does not reduce memory demand. It enables larger models, longer context windows, and more users running AI at once. All of those require more memory than before, just used more efficiently.
Historical data supports this. Previous quantization improvements did not reduce Nvidia or memory chip procurement. They enabled more ambitious AI deployments. Wells Fargo called the sell-off a buying opportunity.
The analysts are probably right. But the immediate reaction tells you how sensitive the market is to anything that threatens the "AI needs infinite hardware forever" thesis.
THE BIGGER PICTURE
TurboQuant is a software-only improvement. No new chip required. No model retraining. It ships as an algorithm.
This is increasingly how AI progress is happening - not just through scaling up hardware, but through better algorithms that extract more from existing hardware. DeepSeek did this earlier in the year with their training efficiency work. Google is doing it now with inference efficiency.
The implication: the hardware moat that Nvidia has built is being eroded from both ends. Training is getting more efficient (less hardware needed to train equivalent models). Inference is getting more efficient (less hardware needed to serve them). The absolute demand for GPUs continues to grow, but the leverage each AI dollar has on compute is increasing.
For Google, this is particularly valuable. They already own enormous infrastructure. Every efficiency improvement on that infrastructure is pure margin.
KEY NUMBERS
Metric | Value |
|---|---|
Compression ratio | 16 bits to 3 bits (5x reduction in bit depth) |
KV cache memory reduction | 6x average |
Attention computation speedup | 8x on H100 |
Accuracy loss | Zero (benchmark-verified) |
Estimated enterprise cost savings | 50%+ per query |
Retraining required | None |
Announcement date | March 24, 2026 |
Conference | ICLR 2026 |
SK Hynix stock drop | -6% |
Samsung stock drop | -5% |
Micron stock drop | -3% |
Until next time,
Wes “happy that AI also struggles with attention” Roth

