TurboQuant Guide

Near-Optimal KV Cache Compression — How it works

Abstract

TurboQuant is a novel vector quantization algorithm proposed by Google Research & DeepMind (ICLR 2026) that achieves near-Shannon-limit distortion rates for KV cache compression. By combining random rotation decorrelation with optimal scalar quantization (PolarQuant) and a lightweight 1-bit QJL residual correction, it compresses high-dimensional KV vectors to 3–3.5 bits with zero training and minimal quality loss.

Why KV Cache?

In autoregressive LLM inference, the KV cache stores attention keys and values for all previous tokens. For long contexts (32K–128K), this cache grows to gigabytes — often larger than the model weights themselves.

The bottleneck isn't compute (TFLOPS) — it's memory bandwidth. Every new token requires reading the entire KV cache through the memory bus. Quantizing the cache reduces bandwidth and enables longer contexts on the same hardware.

The 3-Stage Pipeline

Random Rotation Decorrelation

A data-oblivious random orthogonal matrix Π transforms the KV vector so that its coordinates become approximately independent, each following a known Beta-like distribution.

y = Πx

PolarQuant Scalar Quantization

Each rotated coordinate is independently quantized using a pre-computed Lloyd-Max optimal codebook. This achieves near-optimal 1D distortion without any joint optimization.

D_MSE ≤ (3π²/4) · 4^(-b)

1-bit QJL Residual Correction

A Johnson-Lindenstrauss projection of the quantization residual ensures unbiased inner products. Only m ≪ d bits are needed per vector — the overhead is negligible.

E[⟨y, x̂⟩] = ⟨y, x⟩

Key Results

3.5 bits: >99% attention fidelity, PPL increase <1%
3 bits: 6x VRAM reduction, 8x attention speedup on H100
Zero training: Data-oblivious — works with any pre-trained model
Universal: Compatible with vLLM PagedAttention

References

→ TurboQuant: Online Vector Quantization with Near-Optimal Distortion Rate (arXiv)→ Google Research Blog: TurboQuant — Redefining AI Efficiency → 0xSero/turboquant — vLLM Triton Implementation