Fundamentals5 min read
KV Cache Quantization 101: What Every ML Engineer Should Know
2026-03-05
As LLMs grow larger and context windows longer, the KV cache has become the primary bottleneck for efficient inference. This article explains why.
What is the KV Cache?
During autoregressive generation, each token's attention keys and values are cached to avoid recomputation. For Llama 3 70B at 128K context with batch=8, this cache exceeds 300 GB — larger than the model weights themselves.
python
# KV Cache size calculation
def calculate_kv_cache_size(model_params, context, batch, bits=16):
layers = model_params['num_hidden_layers']
kv_heads = model_params['num_key_value_heads']
head_dim = model_params['hidden_size'] // model_params['num_attention_heads']
size_bytes = 2 * layers * kv_heads * head_dim * context * batch * (bits / 8)
return size_bytes / (1024**3) # Convert to GB
# Llama 3 70B @ 128K context, batch=8, FP16
vram = calculate_kv_cache_size(
{'num_hidden_layers': 80, 'num_key_value_heads': 8,
'hidden_size': 8192, 'num_attention_heads': 64},
context=131072, batch=8, bits=16
)
print(f"KV Cache VRAM: {vram:.1f} GB") # ~304 GBWhy Quantization Works
By reducing the precision of cached values from 16 bits to 3 bits, we achieve a 5.3x reduction in memory bandwidth — directly translating to faster token generation without any model modification or retraining.