KV Cache Quantization 101: What Every ML Engineer Should Know

2026-03-05

As LLMs grow larger and context windows longer, the KV cache has become the primary bottleneck for efficient inference. This article explains why.

What is the KV Cache?

During autoregressive generation, each token's attention keys and values are cached to avoid recomputation. For Llama 3 70B at 128K context with batch=8, this cache exceeds 300 GB — larger than the model weights themselves.

python

# KV Cache size calculation
def calculate_kv_cache_size(model_params, context, batch, bits=16):
    layers = model_params['num_hidden_layers']
    kv_heads = model_params['num_key_value_heads']
    head_dim = model_params['hidden_size'] // model_params['num_attention_heads']

    size_bytes = 2 * layers * kv_heads * head_dim * context * batch * (bits / 8)
    return size_bytes / (1024**3)  # Convert to GB

# Llama 3 70B @ 128K context, batch=8, FP16
vram = calculate_kv_cache_size(
    {'num_hidden_layers': 80, 'num_key_value_heads': 8,
     'hidden_size': 8192, 'num_attention_heads': 64},
    context=131072, batch=8, bits=16
)
print(f"KV Cache VRAM: {vram:.1f} GB")  # ~304 GB

Why Quantization Works

By reducing the precision of cached values from 16 bits to 3 bits, we achieve a 5.3x reduction in memory bandwidth — directly translating to faster token generation without any model modification or retraining.

What is the KV Cache?

Why Quantization Works

Related Posts