TurboQuant Explained: How Random Rotation Beats 20 Years of VQ
2026-04-01
TurboQuant represents a fundamental shift in how we approach KV cache quantization. Unlike prior methods that require offline calibration data and model-specific tuning, TurboQuant operates completely data-obliviously.
The Core Insight
The key insight is that high-dimensional vector quantization can be decomposed into independent 1D scalar quantization problems. By applying a random orthogonal rotation, correlated KV vectors are transformed into approximately independent components, each following a known statistical distribution.
def turboquant_encode(x, bits=3):
# Step 1: Generate random orthogonal matrix
R = random_orthogonal_matrix(x.shape[-1])
# Step 2: Apply rotation to decorrelate
x_rotated = x @ R.T
# Step 3: Scalar quantize each dimension
codebook = get_optimal_codebook(bits)
x_quantized = scalar_quantize(x_rotated, codebook)
return x_quantized, RWhy 3-bit Works
Traditional uniform quantization degrades significantly below 4 bits because the distortion rate O(4^-b) is far from the Shannon limit. TurboQuant's PolarQuant achieves near-Shannon optimal distortion for scalar quantization, meaning it continues to perform well even at 3-bit and 3.5-bit.
QJL: The Missing Piece
The Quantized Johnson-Lindenstrauss (QJL) correction adds a tiny 1-bit sketch that removes systematic bias from inner product estimation. This is critical because attention scores depend on the inner product between queries and keys — any bias directly degrades model quality.
def qjl_correction(residual, m=64):
# Random projection matrix W ∈ ℝ^(m×d)
W = random_projection_matrix(m, residual.shape[-1])
# 1-bit sketch: just signs
sketch = np.sign(W @ residual)
return sketch # Only m bits stored