Side-by-side comparison of KV cache quantization methods
| Algorithm | Bits | PPL (↓) | Memory | Speed | Online | No Training |
|---|---|---|---|---|---|---|
| TurboQuant | 3-bit | ~baseline | ~4x | ~2x | ✓ | ✓ |
| TurboQuant | 2.5-bit | +0.1 | ~6x | ~3x | ✓ | ✓ |
| GPTQ | 4-bit | +0.3 | ~4x | ~1.5x | ✗ | ✗ |
| AWQ | 4-bit | +0.2 | ~4x | ~1.5x | ✗ | ✗ |
| KVQuant | 4-bit | +0.15 | ~4x | ~1x | ✗ | ✗ |
| FP8 | 8-bit | +0.05 | ~2x | ~1.2x | ✓ | ✓ |
Sources: arXiv:2504.19874 (Google Research) • 0xSero/turboquant implementation • Public benchmarks
TurboQuant consistently outperforms GPTQ and KIVI across all evaluated configurations. At 3.5-bit, TurboQuant achieves near-lossless quality (PPL within 1% of FP16) while reducing KV cache memory by over 80%. At 3-bit, the 8x attention speedup on H100 GPUs comes from reduced HBM bandwidth pressure — the primary bottleneck in long-context inference. Unlike GPTQ which requires offline calibration data, TurboQuant operates online with zero training overhead.