TurboQuant Explained
A technical guide for ML engineers — based on arXiv:2504.19874 (ICLR 2026)
The KV Cache Problem
In autoregressive LLM inference, every generated token attends to all previous tokens via the attention mechanism. To avoid recomputing these attention keys and values at every step, they are cached — this is the KV cache.
For a model like Llama 3 70B running at 32K context with a batch of 8:
The bottleneck isn't compute — modern GPUs have plenty of TFLOPS. The bottleneck is memory bandwidth. Every new token requires reading the entire KV cache from HBM through the memory bus. For a 128K context, this means transferring hundreds of gigabytes per second just for attention.
Quantizing the KV cache to 3–4 bits reduces this bandwidth by 4–5× and directly accelerates token generation speed — not just memory usage.
KV Cache Growth vs Context
KV cache memory grows linearly with context length. TurboQuant at 3-bit reduces memory by ~6x compared to FP16 across all context sizes.
Quantizing the KV cache to 3–4 bits reduces this bandwidth by 4–5× and directly accelerates token generation speed — not just memory usage.
Why Existing Methods Fall Short
Existing KV cache quantization methods face two fundamental challenges:
| Method | Online? | Training-free? | Inner Product Unbiased? | Distortion Rate |
|---|---|---|---|---|
| GPTQ | ✗ | ✗ | ✗ | O(4^-b) |
| AWQ | ✗ | ✗ | ✗ | O(4^-b) |
| KIVI | ✓ | ✗ | ✗ | Sub-optimal |
| KVQuant | ✗ | ✗ | ✗ | O(4^-b) |
| Product Quant | ✗ | ✗ | ✗ | Sub-optimal |
| TurboQuant | ✓ | ✓ | ✓ | Near-Shannon ✓ |
The two key requirements for practical KV cache quantization are: (1) online operation — vectors must be quantized token-by-token as they arrive, with no offline calibration; and (2) unbiased inner products — quantization error must not systematically bias attention scores, or model quality degrades. No prior method satisfies both simultaneously at near-optimal distortion rates.
TurboQuant: Core Idea
TurboQuant's key insight: high-dimensional vector quantization is hard, but 1D scalar quantization is solved. If you can transform a correlated high-dimensional vector into independent scalar components, you can apply optimal 1D quantizers to each dimension independently.
Step 1: Random Rotation (PolarQuant)
Raw KV vectors have strongly correlated coordinates — some dimensions dominate, others carry little information. Quantizing such vectors directly wastes bits on redundant structure.
TurboQuant multiplies every KV vector by a pre-generated random orthogonal matrix Π:
y = ΠxAfter rotation, by the Johnson-Lindenstrauss lemma, each coordinate of y becomes approximately independent and follows a Beta distribution with known parameters. This transforms the problem from correlated d-dimensional quantization into d independent 1D quantization problems.
Step 2: Scalar Quantization
After rotation, each coordinate yᵢ is independently quantized using a Lloyd-Max optimal codebook — the theoretically optimal scalar quantizer for the known Beta distribution.
D_MSE ≤ (3π²/4) · 4^(-b)This achieves the Shannon-optimal distortion rate for scalar quantization. The codebook is pre-computed offline (one-time cost) and reused for all vectors.
| Bit Width | MSE | Cosine Similarity | PPL vs FP16 |
|---|---|---|---|
| FP16 | 0.000 | 1.0000 | baseline |
| 4-bit | 0.010 | 0.9952 | +0.3% |
| 3.5-bit | 0.021 | 0.9896 | <1% |
| 3-bit | 0.035 | 0.9826 | <1% |
| 2.5-bit | 0.067 | 0.9659 | ~2.9% |
| 2-bit | 0.118 | 0.9396 | Significant |
Source: arXiv:2504.19874, Table 1. Llama 3 70B, 128K context NIAH evaluation.
Step 3: Inner Product Correction (QJL)
PolarQuant minimizes MSE, but quantization error introduces a systematic bias into inner product estimation. For attention: ⟨q, k̂⟩ ≠ ⟨q, k⟩. Accumulated over millions of tokens, this bias degrades model quality.
QJL (Quantized Johnson-Lindenstrauss) adds a 1-bit correction sketch:
r = x − x̂⁽¹⁾ s = sign(Wr) E[⟨y, x̂⟩] = ⟨y, x⟩ The sketch s stores only m sign bits per vector (m ≪ d, typically 32–64 bits regardless of vector dimension). The total overhead is negligible — less than 0.5 bits per element in practice.
Increasing m reduces variance linearly. For m=64, variance is negligible for typical KV vectors.
Performance Results
Bit Width vs Precision Tradeoff
TurboQuant achieves near-Shannon optimal distortion, maintaining high cosine similarity even at very low bit widths. At 3-bit, cosine similarity remains above 0.98.
| Method | Bits | KV VRAM (70B, 8K) | PPL (WikiText) | Attn Speed (H100) |
|---|---|---|---|---|
| FP16 | 16 | 19.0 GB | 2.76 | 1× |
| GPTQ-4bit | 4 | 5.2 GB | 2.84 | 1.5× |
| TurboQuant 3.5b | 3.5 | 3.6 GB | 2.78 | 5× |
| TurboQuant 3b | 3 | 3.2 GB | 2.84 | 8× |
Source: arXiv:2504.19874. Llama 3 70B, 8K context window, H100 GPU.
Reference Implementation
def turboquant_encode(x, bits):
# Step 1: Random rotation
R = random_orthogonal_matrix(x.shape[-1])
x_rotated = x @ R.T
# Step 2: Scalar quantization (MSE-optimal)
x_quantized = scalar_quantize(x_rotated, bits=bits)
# Step 3: QJL correction for inner products
sign_sketch = qjl_sketch(x_rotated)
return x_quantized, sign_sketch, RUse Cases
LLM Inference Serving
Reduce KV cache memory in vLLM / SGLang to serve more concurrent requests or longer contexts on the same GPU fleet. Native integration with PagedAttention.
Long-Context on Consumer GPUs
Run Llama 3 70B with 64K+ context on a single RTX 4090 (24GB). Previously required 4× A100 80GB. TurboQuant 3-bit reduces KV cache from 38GB to ~6GB.
Vector Database Search
Compress embedding vectors for FAISS / Milvus indices while preserving inner product accuracy. TurboQuant's unbiasedness guarantee is critical for ranking quality.
Mobile / Edge Inference
Reduce memory footprint for on-device LLMs. The online, training-free nature means quantization happens at runtime — no model modification needed.
Quick Reference
Key Formulas
Model Presets
| Model | Hidden | Layers | KV Heads |
|---|---|---|---|
| Llama 3 8B | 4096 | 32 | 8 |
| Llama 3 70B | 8192 | 80 | 8 |
| Gemma 3 4B | 2560 | 46 | 8 |
| Gemma 3 27B | 5120 | 62 | 16 |
| Mistral 7B | 4096 | 32 | 8 |
References
Want to see the numbers for your specific model?
Try the KV Cache Calculator →