Performance comparison across quantization methods
| Model | Context | Method | Bits | VRAM | PPL | Speed |
|---|---|---|---|---|---|---|
| Llama-3-70B | 8K | FP16 | 16 | 100% | 3.12 | 1.0x |
| Llama-3-70B | 8K | GPTQ-4bit | 4 | 25% | 3.89 | 2.1x |
| Llama-3-70B | 8K | KIVI | 3 | 18.75% | 3.45 | 3.2x |
| Llama-3-70B | 8K | TurboQuant 3.5-bit | 3.5 | 18.75% | 3.15 | 7.8x |
| Llama-3-70B | 8K | TurboQuant 3-bit | 3 | 18.75% | 3.21 | 8.0x |
| Llama-3-8B | 32K | FP16 | 16 | 100% | 5.21 | 1.0x |
| Llama-3-8B | 32K | INT8 | 8 | 50% | 5.34 | 2.8x |
| Llama-3-8B | 32K | TurboQuant 3-bit | 3 | 18.75% | 5.29 | 7.5x |
Sources: arXiv:2504.19874 (Google Research) • 0xSero/turboquant implementation • Public benchmarks
6x
VRAM reduction at 3-bit
8x
Attention speedup on H100
<1%
PPL degradation at 3.5-bit