TurboQuant Explained

A technical guide for ML engineers — based on arXiv:2504.19874 (ICLR 2026)

The KV Cache Problem

In autoregressive LLM inference, every generated token attends to all previous tokens via the attention mechanism. To avoid recomputing these attention keys and values at every step, they are cached — this is the KV cache.

For a model like Llama 3 70B running at 32K context with a batch of 8:

FP16 KV Cache
128 GB
Llama 3 70B, 32K ctx, batch=8
Model Weights
~140 GB
FP16 weights alone
Total VRAM
>256 GB
4× H100 80GB minimum

The bottleneck isn't compute — modern GPUs have plenty of TFLOPS. The bottleneck is memory bandwidth. Every new token requires reading the entire KV cache from HBM through the memory bus. For a 128K context, this means transferring hundreds of gigabytes per second just for attention.

Quantizing the KV cache to 3–4 bits reduces this bandwidth by 4–5× and directly accelerates token generation speed — not just memory usage.

KV Cache Growth vs Context

KV cache memory grows linearly with context length. TurboQuant at 3-bit reduces memory by ~6x compared to FP16 across all context sizes.

Quantizing the KV cache to 3–4 bits reduces this bandwidth by 4–5× and directly accelerates token generation speed — not just memory usage.

Why Existing Methods Fall Short

Existing KV cache quantization methods face two fundamental challenges:

MethodOnline?Training-free?Inner Product Unbiased?Distortion Rate
GPTQO(4^-b)
AWQO(4^-b)
KIVISub-optimal
KVQuantO(4^-b)
Product QuantSub-optimal
TurboQuantNear-Shannon ✓

The two key requirements for practical KV cache quantization are: (1) online operation — vectors must be quantized token-by-token as they arrive, with no offline calibration; and (2) unbiased inner products — quantization error must not systematically bias attention scores, or model quality degrades. No prior method satisfies both simultaneously at near-optimal distortion rates.

TurboQuant: Core Idea

TurboQuant's key insight: high-dimensional vector quantization is hard, but 1D scalar quantization is solved. If you can transform a correlated high-dimensional vector into independent scalar components, you can apply optimal 1D quantizers to each dimension independently.

TurboQuant Pipeline
x ∈ ℝᵈ→ Random Rotation →y = Πx→ PolarQuant →x̂⁽¹⁾→ QJL Correction →x̂ (final)

Step 1: Random Rotation (PolarQuant)

Raw KV vectors have strongly correlated coordinates — some dimensions dominate, others carry little information. Quantizing such vectors directly wastes bits on redundant structure.

TurboQuant multiplies every KV vector by a pre-generated random orthogonal matrix Π:

y = Πx

After rotation, by the Johnson-Lindenstrauss lemma, each coordinate of y becomes approximately independent and follows a Beta distribution with known parameters. This transforms the problem from correlated d-dimensional quantization into d independent 1D quantization problems.

Why random and not learned?A random matrix is data-oblivious — it works for any input without calibration. Crucially, the same matrix Π is used for both keys and queries, so inner products are preserved: ⟨Πq, Πk⟩ = ⟨q, k⟩. No training needed.

Step 2: Scalar Quantization

After rotation, each coordinate yᵢ is independently quantized using a Lloyd-Max optimal codebook — the theoretically optimal scalar quantizer for the known Beta distribution.

D_MSE ≤ (3π²/4) · 4^(-b)

This achieves the Shannon-optimal distortion rate for scalar quantization. The codebook is pre-computed offline (one-time cost) and reused for all vectors.

Bit WidthMSECosine SimilarityPPL vs FP16
FP160.0001.0000baseline
4-bit0.0100.9952+0.3%
3.5-bit0.0210.9896<1%
3-bit0.0350.9826<1%
2.5-bit0.0670.9659~2.9%
2-bit0.1180.9396Significant

Source: arXiv:2504.19874, Table 1. Llama 3 70B, 128K context NIAH evaluation.

Step 3: Inner Product Correction (QJL)

PolarQuant minimizes MSE, but quantization error introduces a systematic bias into inner product estimation. For attention: ⟨q, k̂⟩ ≠ ⟨q, k⟩. Accumulated over millions of tokens, this bias degrades model quality.

QJL (Quantized Johnson-Lindenstrauss) adds a 1-bit correction sketch:

r = x − x̂⁽¹⁾ s = sign(Wr) E[⟨y, x̂⟩] = ⟨y, x⟩

The sketch s stores only m sign bits per vector (m ≪ d, typically 32–64 bits regardless of vector dimension). The total overhead is negligible — less than 0.5 bits per element in practice.

Variance bound:Var[⟨y, x̂⟩] ≤ (π/2d)‖y‖² · ‖r‖² / m
Increasing m reduces variance linearly. For m=64, variance is negligible for typical KV vectors.

Performance Results

Bit Width vs Precision Tradeoff

TurboQuant achieves near-Shannon optimal distortion, maintaining high cosine similarity even at very low bit widths. At 3-bit, cosine similarity remains above 0.98.

MethodBitsKV VRAM (70B, 8K)PPL (WikiText)Attn Speed (H100)
FP161619.0 GB2.76
GPTQ-4bit45.2 GB2.841.5×
TurboQuant 3.5b3.53.6 GB2.78
TurboQuant 3b33.2 GB2.84

Source: arXiv:2504.19874. Llama 3 70B, 8K context window, H100 GPU.

Reference Implementation

python
def turboquant_encode(x, bits):
    # Step 1: Random rotation
    R = random_orthogonal_matrix(x.shape[-1])
    x_rotated = x @ R.T
    
    # Step 2: Scalar quantization (MSE-optimal)
    x_quantized = scalar_quantize(x_rotated, bits=bits)
    
    # Step 3: QJL correction for inner products
    sign_sketch = qjl_sketch(x_rotated)
    
    return x_quantized, sign_sketch, R

Use Cases

LLM Inference Serving

Reduce KV cache memory in vLLM / SGLang to serve more concurrent requests or longer contexts on the same GPU fleet. Native integration with PagedAttention.

vLLMSGLangPagedAttention

Long-Context on Consumer GPUs

Run Llama 3 70B with 64K+ context on a single RTX 4090 (24GB). Previously required 4× A100 80GB. TurboQuant 3-bit reduces KV cache from 38GB to ~6GB.

RTX 409064K contextSingle GPU

Vector Database Search

Compress embedding vectors for FAISS / Milvus indices while preserving inner product accuracy. TurboQuant's unbiasedness guarantee is critical for ranking quality.

FAISSMilvusRAG

Mobile / Edge Inference

Reduce memory footprint for on-device LLMs. The online, training-free nature means quantization happens at runtime — no model modification needed.

On-deviceEdgeMobile

Quick Reference

Key Formulas

KV Cache: 2 × layers × kv_heads × head_dim × context × batch × (bits/8)
Distortion: D_MSE ≤ (3π²/4) · 4^(-b)
Rotation: y = Πx, Π ∈ O(d) random orthogonal
QJL: s = sign(Wr), W ∈ ℝ^(m×d)

Model Presets

ModelHiddenLayersKV Heads
Llama 3 8B4096328
Llama 3 70B8192808
Gemma 3 4B2560468
Gemma 3 27B51206216
Mistral 7B4096328

References

Want to see the numbers for your specific model?

Try the KV Cache Calculator →
© 2026 TurboQuant Guide — Community resource. Not affiliated with Google LLC.