TurboQuant: Google's KV Cache Compression Analysis

Every few months, an AI paper gets breathless coverage claiming to have “solved” memory constraints, only to quietly fade into the “interesting but not practical” pile. TurboQuant — published by Amir Zandieh and Majid Hadian (Google Research/DeepMind), Majid Daliri (NYU), and Vahab Mirrokni (Google Research) — deserves more careful treatment than that. It’s a genuinely elegant piece of theoretical work with real practical payoff, but only if you understand exactly what it’s solving and where the messy edges are.

The short version: TurboQuant compresses your KV cache to 3.5 bits with quality-neutral results on standard benchmarks, requires zero calibration or fine-tuning, and is provably within a 2.7× constant factor of the information-theoretic lower bound for distortion. Google’s blog rounds this to “3-bit with zero accuracy loss” — a reasonable simplification for a headline, but the paper tables tell a more precise story at 3.5-bit. That last part — the theoretical bound — is the contribution that makes mathematically inclined readers perk up.

Let’s dig into why it works, where it actually helps, and where community implementations have found the limits.

The Problem: KV Cache Is Eating Your Memory

To understand why TurboQuant matters, you need to appreciate how bad the KV cache problem has become. During inference, transformers store key and value vectors for every token in the context window. At FP16, a single 8B model running 512 concurrent users at 32K tokens consumes over 512 GB of KV cache memory — dwarfing the model weight footprint many times over. As context windows push toward 256K tokens (hello, Qwen3.5), this becomes the primary bottleneck, not compute.

Previous quantization attempts like KIVI (ICML 2024) got you to about 2.6× compression with asymmetric 2-bit quantization, and like TurboQuant it is tuning-free. That’s genuinely useful, but it still introduces systematic bias in attention score calculations, and the standard per-block normalization constants and scale factors eat back 1–2 bits of the savings you just made. You end up paying overhead to track the overhead.

The Core Idea: Gaussianize, Then Quantize

TurboQuant’s insight is clean and beautiful in the way good algorithms tend to be: make hard vector quantization look easy by first randomizing the geometry, then spending one extra bit to fix the part that matters for attention.

Stage 1: Random Rotation (PolarQuant)

Take a vector $x$ and multiply by a random orthogonal matrix $\Pi$ . After rotation, something mathematically convenient happens: each coordinate now follows a Beta distribution that concentrates around a Gaussian in high dimensions, and the coordinates become nearly independent. This transforms an inherently hard joint vector quantization problem into many easy independent scalar quantization problems.

Once you’re in that rotated space, you apply Lloyd-Max scalar quantization per coordinate — the MSE-optimal scalar quantizer, which places centroids exactly where probability theory says they should go. This is Algorithm 1 in the paper. Dequantization is simply: look up centroids, reconstruct, multiply by $\Pi^\top$ to rotate back.

The paper calls this first stage PolarQuant (to be presented at AISTATS 2026 separately). It achieves near-optimal MSE distortion rates that scale as $4^{-b}$ with bitwidth $b$ — which is essentially the best you can do per Shannon’s source coding theory.

Stage 2: 1-bit QJL Residual Correction

Here’s the wrinkle MSE optimization alone doesn’t solve: pure MSE quantization introduces bias in inner product estimates. In attention, you’re computing $Q \cdot K^T$ — dot products between queries and keys. If your quantized keys systematically under- or over-represent certain directions, your attention scores drift, and that drift accumulates across layers. This is subtle but real.

TurboQuant’s fix is elegant: after the MSE quantization step, compute the residual $r = x - \tilde{x}_{\text{mse}}$ and store a 1-bit Quantized Johnson-Lindenstrauss (QJL) sketch of it, plus its L2 norm. At dequantization time, add this back. The QJL projection preserves inner product geometry with a sign bit — mathematically it creates an unbiased estimator of the true dot product.

The paper proves in Theorem 2 that the $b$ -bit inner product TurboQuant achieves near-optimal inner product distortion without any assumptions about the input vectors. That “without assumptions” qualifier is what “data-oblivious” means in the paper’s framing — you don’t need calibration data, domain-specific tuning, or any knowledge of the distribution ahead of time.

The Practical Rotation Trick

One detail matters a lot for practitioners: the dense $O(d^2)$ random matrix multiply is too slow for inference. In practice, community implementations use a randomized Hadamard transform (WHT + random sign flips), bringing the cost down to $O(d \log d)$ . The paper notes this approach as an efficient approximation; the dense matrix version is what the proofs assume.

What the Paper Actually Claims (Numbers That Matter)

Before getting to community results, here are the numbers straight from Google’s evaluation, which used Llama-3.1-8B-Instruct, Gemma, and Mistral across LongBench, Needle-in-a-Haystack (NIAH), ZeroSCROLLS, RULER, and L-Eval:

LongBench (Llama-3.1-8B-Instruct): TurboQuant at 3.5-bit matched full-cache performance exactly — 50.06 vs 50.06. At 2.5-bit, it dropped only slightly: 49.44 on Llama-3.1-8B and 49.62 on Ministral-7B. For context, KIVI degraded noticeably at these compression levels.
Needle-in-a-Haystack: TurboQuant and full-precision both score 0.997 in the paper’s figure at up to 104K token context under 4× compression — the benchmark effectively treats them as identical. For comparison, KIVI scores 0.981 and PolarQuant alone scores 0.995. The residual QJL correction is doing real work here.
Attention logit speedup on H100: 4-bit TurboQuant delivers up to 8× faster attention logit computation compared to unquantized 32-bit keys. This is a kernel-level result on H100 with Google’s JAX implementation as baseline — not an end-to-end inference number.
Vector search (GloVe dataset, $d=200$ ): TurboQuant achieves the highest 1@k recall ratio against Product Quantization and RabitQ baselines while using no dataset-specific tuning. Indexing time drops to effectively zero (0.0013 seconds for 1,536-dimensional vectors vs. 239.75 seconds for Product Quantization).

The memory reduction figure depends on the bitwidth setting: the paper reports at least 4.5× in its LongBench setup, and the ~6× headline figure reflects more aggressive 2.5-bit configurations that also eliminate per-block quantization constant overhead — something most prior methods can’t avoid.

One honest caveat: Google’s benchmark models top out at roughly 8B parameters. Whether identical guarantees hold at 70B or 400B scale remains an open question the paper does not answer.

The Qwen3.5 Reality Check: Hybrid Architecture Complicates Things

Qwen3.5 has become the de facto benchmark target for community TurboQuant implementations, and it’s a useful stress test because it’s a hybrid architecture. This matters more than most coverage acknowledges.

Qwen3.5 mixes full standard attention layers with linear attention and Mamba-style recurrent layers. Only the full-attention layers have a KV cache that TurboQuant applies to. In Qwen3.5-27B, that’s 16 full-attention layers out of 64 total. In the 40-layer variants, roughly 10 layers qualify. This means end-to-end savings are structurally capped — you’re not compressing 100% of the cache, you’re compressing 25–40% of it.

With that context, here’s what the major community implementations have actually found:

0xSero / vLLM / Qwen3.5-27B (CUDA)

The most concrete real-world result. Using 3-bit keys and 2-bit group quantization for values with Triton kernels (see the repo for current hardware config, which has evolved across RTX 3090 and RTX 5090 setups):

Max token capacity doubled: 457,072 → 914,144 tokens (hardware config has evolved across repo updates — see the live README for current setup)
~30 GB of KV cache freed across the GPU cluster
Output described as “identical for the same prompt”
Per full-attention layer compression ratio: ~2.6× (198 bytes/token vs. 512 bytes/token at bf16)
Note: the headline 6× figure doesn’t appear here because only 16/64 layers benefit

This is the result that most directly translates to “I can serve more concurrent users” — a real, measurable win.

unixsysdev / llama.cpp fork / Qwen3.5-0.8B (ROCm/Radeon)

A CPU-focused implementation using per-block Walsh-Hadamard transforms. Unlike some others, it does implement QJL residual sign correction, though the README notes this is an approximation rather than a strict paper-faithful reproduction:

4.6× K-cache compression achieved
Perplexity degradation: 4.6–5.8% depending on configuration
Throughput impact vs FP16: −3.9% prefill, −2.1% decode

Worth noting: 0.8B is below the size range where community implementations agree results are reliable. The author explicitly flags this. Take the perplexity numbers with that grain of salt.

TheTom / Metal / Qwen3.5-35B-A3B (Apple Silicon M5 Max)

The most polished Metal implementation, including an additional sparse-V optimization beyond the paper:

Prefill speed: 2,747 tok/s with TurboQuant vs 2,694 tok/s for q8_0 — essentially parity
4.6× compression ratio
Perplexity: 5.445 vs 5.414 for q8_0 — about 1% PPL degradation
Needle-in-a-Haystack: 9/9 single-needle retrieval — better than their q8_0 baseline

The 9/9 NIAH result is the standout here. The sparse-V dequant optimization is extra engineering beyond paper-faithful TurboQuant, but the result holds up. This is the implementation that comes closest to the paper’s claims in a real-deployment scenario.

flovflo / MLX / Qwen3.5 target model (Apple Silicon)

On a 2,048 prompt / 8 generation token benchmark:

+32.0% prompt throughput
+25.7% decode throughput
−26.0% wall time
−43.7% KV cache size

These are impressive numbers, but the author is unusually candid: this implementation uses MLX affine quantization rather than the paper’s rotation approach, and the residual correction is not a faithful QJL estimator. It explicitly calls itself “TurboQuant-inspired.” The performance gains are real, but attributing them specifically to TurboQuant’s theoretical properties would be a stretch.

zapabob / CUDA / Qwen3.5-9B Replay Audit

The most skeptical community voice, and worth taking seriously. Their audit on captured Qwen3.5-9B replay found that:

full_kv compression (keys + values) hurt V-dependent hidden-state transport before it meaningfully hurt logit scores
key_only_random preserved hidden geometry better than full_kv
2.5-bit and 3.5-bit remain useful Pareto points on the precision/compression curve, but the “no quality loss in general runtimes” messaging oversimplifies the picture

This matches a pattern others have noted: TurboQuant’s theoretical guarantees are built around inner product distortion — which governs attention score computation (key side). Value vectors feed the output projection, not the attention scoring, and the distortion profile there is different. The paper is honest about this if you read carefully; some of the hype around it is not.

What Hardware Do You Actually Need?

Implementation	Hardware	Status
llama.cpp CPU (Aaryan-Kapoor)	Any modern CPU	Working, both K+V
llama.cpp Metal (TheTom)	Apple Silicon (M-series)	Working, extras included
vLLM + Triton (0xSero)	CUDA (RTX 3090 / RTX 5090, see repo)	Working proof of concept
MLX (flovflo)	Apple Silicon	Working, not paper-faithful
PyPI `turboquant` package	Any CUDA/CPU, HuggingFace	4-bit, pip-installable
Official Google code	—	Not yet released, no public timeline

The PyPI turboquant package is the lowest-friction entry point for experimentation: three lines to wrap a HuggingFace model’s KV cache. On an RTX 4080 with a 7B model at 1.8K context, a community benchmark reported 40% faster inference at the point where FP16 began swapping to system RAM. At shorter contexts, the benefit is mostly memory reduction.

Model size floor: Community implementations consistently find that results degrade noticeably on models below roughly 7–8B parameters — the 0.6B–0.8B range is a known rough edge. This isn’t an explicit paper claim; it’s an empirical pattern across independent implementations. Don’t expect miracles at sub-7B.

TurboQuant vs. The Competition

Method	Compression	Calibration needed	Bitwidth overhead	Theoretically bounded
KIVI (ICML 2024)	~2.6×	No	Yes	No
TurboQuant	~4.5–6×	No	No	Yes (2.7× of optimal)
KVTC (NVIDIA, ICLR 2026)	~20×	Yes	Unclear	No
GEAR	~2–3×	Yes	Yes	No

KVTC (NVIDIA’s competing method debuting at the same ICLR 2026 conference) achieves dramatically higher raw compression at 20× — but requires per-model calibration, which reintroduces deployment friction. KIVI is also calibration-free, but TurboQuant’s elimination of per-block overhead and its inner-product bias correction are what push it ahead in practice. For production systems serving many different models, the operational cost of calibration is real even when it’s a one-time step.

The “data-oblivious” property is not just a theoretical nicety. It means TurboQuant works the first time, on any model, without any warmup data. That’s a meaningful practical advantage for inference infrastructure teams.

The Things the Hype Gets Wrong

Let me be direct about where the breathless coverage overreaches.

“Zero accuracy loss” is specifically measured on certain benchmarks with models ≤ 8B. The NIAH and LongBench results are legitimately impressive. But community audits on value-dependent tasks and hidden-state transport show more nuance, especially on hybrid architectures like Qwen3.5.

“8× speedup” is a kernel-level attention logit computation benchmark on H100 with Google’s JAX implementation. End-to-end inference throughput improvements in community implementations range from modest (−3.9% prefill on CPU) to substantial (+32% prompt throughput on Apple Silicon) depending heavily on implementation quality, hardware, and context length. There is no single number here.

The values problem is real. The paper’s theoretical guarantees and strongest empirical results cluster around the key side — inner product preservation for attention scoring. Value compression is trickier and less well-covered. This is why the most careful implementations (like 0xSero’s) use TurboQuant-style compression for 3-bit keys but simpler group quantization for 2-bit values.

When Should You Use It?

Use TurboQuant if:
– You’re running models ≥ 8B at longer context lengths (16K+)
– You’re memory-constrained and need to fit more concurrent users
– You’re deploying on Apple Silicon (Metal implementations are mature)
– You need a drop-in, calibration-free solution
– You’re working with standard transformer decoder architectures (Llama, Mistral, Gemma)

Be cautious if:
– Your model is hybrid-attention (Qwen3.5’s MoE layers get only partial benefit)
– You need sub-8B model quality guarantees
– You need a production-grade, officially supported implementation (Google has not released code as of late March 2026, and no public timeline exists)
– Your workload is primarily short-context — you’ll see memory savings but limited throughput gains

The Bigger Picture

There’s a useful perspective from the broader compression research community worth acknowledging: TurboQuant is close to an information-theoretic limit. Its 2.7× constant factor from the Shannon lower bound means you’re near the ceiling of what compression alone can achieve for KV caches. Future progress will likely require architectural changes — different attention mechanisms, hierarchical caching, or abandoning standard attention altogether — rather than incrementally better quantization.

That’s not a knock on TurboQuant. Being at the practical frontier of a compression regime and having a clean theoretical proof and working without calibration is genuinely rare. The paper from Zandieh and Mirrokni is the kind of work that should sit on the shelf of anyone building inference infrastructure — not because it’s the last word, but because it establishes where the wall is.

The community implementation momentum is also real. From Triton kernels to Metal shaders to a pip-installable PyPI package, the algorithm has been validated by independent developers working from the paper alone, without Google’s code. That’s a good sign for a paper’s reproducibility.

Google has not yet released official TurboQuant code, and no firm public timeline has been announced. When it eventually lands in vLLM and llama.cpp proper — follow this discussion thread for progress — TurboQuant will likely become the default KV cache compression strategy for long-context inference, the same way FlashAttention became the default attention implementation. Until then, the community implementations are functional enough to experiment with, and the theoretical foundation is solid enough to trust.

TurboQuant: Google’s KV Cache Compression Analysis

The Problem: KV Cache Is Eating Your Memory

The Core Idea: Gaussianize, Then Quantize

Stage 1: Random Rotation (PolarQuant)

Stage 2: 1-bit QJL Residual Correction

The Practical Rotation Trick

What the Paper Actually Claims (Numbers That Matter)

The Qwen3.5 Reality Check: Hybrid Architecture Complicates Things

0xSero / vLLM / Qwen3.5-27B (CUDA)

unixsysdev / llama.cpp fork / Qwen3.5-0.8B (ROCm/Radeon)

TheTom / Metal / Qwen3.5-35B-A3B (Apple Silicon M5 Max)

flovflo / MLX / Qwen3.5 target model (Apple Silicon)

zapabob / CUDA / Qwen3.5-9B Replay Audit

What Hardware Do You Actually Need?

TurboQuant vs. The Competition

The Things the Hype Gets Wrong

When Should You Use It?

The Bigger Picture

Related

Leave a ReplyCancel reply

When a Git Worktree Became an AI Agent Escape Hatch

From Chatbots to AI Coworkers: The Rise of Agentic Work

Teaching AI to Imagine Before It Acts

US Government Halts Anthropic’s AI Models Citing Security Fears, Sparks Industry Controversy

The Build Log That Spoke to AI Agents

Half a Billion Dollar AI Blunder: The Hidden Costs of Unchecked Tech Spending

ECC v2.0: Elevating Agentic Work with Versatile Operator Systems and Open-Source Innovation

The Vulnerability Bottleneck Has Moved

China’s First Real Gaming GPU Is Here — And That Matters More Than FPS