Near-Optimal KV Cache Compression — How it works
TurboQuant is a novel vector quantization algorithm proposed by Google Research & DeepMind (ICLR 2026) that achieves near-Shannon-limit distortion rates for KV cache compression. By combining random rotation decorrelation with optimal scalar quantization (PolarQuant) and a lightweight 1-bit QJL residual correction, it compresses high-dimensional KV vectors to 3–3.5 bits with zero training and minimal quality loss.
In autoregressive LLM inference, the KV cache stores attention keys and values for all previous tokens. For long contexts (32K–128K), this cache grows to gigabytes — often larger than the model weights themselves.
The bottleneck isn't compute (TFLOPS) — it's memory bandwidth. Every new token requires reading the entire KV cache through the memory bus. Quantizing the cache reduces bandwidth and enables longer contexts on the same hardware.
A data-oblivious random orthogonal matrix Π transforms the KV vector so that its coordinates become approximately independent, each following a known Beta-like distribution.
y = ΠxEach rotated coordinate is independently quantized using a pre-computed Lloyd-Max optimal codebook. This achieves near-optimal 1D distortion without any joint optimization.
D_MSE ≤ (3π²/4) · 4^(-b)A Johnson-Lindenstrauss projection of the quantization residual ensures unbiased inner products. Only m ≪ d bits are needed per vector — the overhead is negligible.
E[⟨y, x̂⟩] = ⟨y, x⟩