TurboQuant Explained: How Random Rotation Beats 20 Years of VQ

2026-04-01

TurboQuant represents a fundamental shift in how we approach KV cache quantization. Unlike prior methods that require offline calibration data and model-specific tuning, TurboQuant operates completely data-obliviously.

The Core Insight

The key insight is that high-dimensional vector quantization can be decomposed into independent 1D scalar quantization problems. By applying a random orthogonal rotation, correlated KV vectors are transformed into approximately independent components, each following a known statistical distribution.

python

def turboquant_encode(x, bits=3):
    # Step 1: Generate random orthogonal matrix
    R = random_orthogonal_matrix(x.shape[-1])

    # Step 2: Apply rotation to decorrelate
    x_rotated = x @ R.T

    # Step 3: Scalar quantize each dimension
    codebook = get_optimal_codebook(bits)
    x_quantized = scalar_quantize(x_rotated, codebook)

    return x_quantized, R

Why 3-bit Works

Traditional uniform quantization degrades significantly below 4 bits because the distortion rate O(4^-b) is far from the Shannon limit. TurboQuant's PolarQuant achieves near-Shannon optimal distortion for scalar quantization, meaning it continues to perform well even at 3-bit and 3.5-bit.

QJL: The Missing Piece

The Quantized Johnson-Lindenstrauss (QJL) correction adds a tiny 1-bit sketch that removes systematic bias from inner product estimation. This is critical because attention scores depend on the inner product between queries and keys — any bias directly degrades model quality.

python

def qjl_correction(residual, m=64):
    # Random projection matrix W ∈ ℝ^(m×d)
    W = random_projection_matrix(m, residual.shape[-1])

    # 1-bit sketch: just signs
    sketch = np.sign(W @ residual)

    return sketch  # Only m bits stored

The Core Insight

Why 3-bit Works

QJL: The Missing Piece

Related Posts