Q4 vs Q5 vs Q8: quale quantization per Llama 70B nel 2026?
DO
Damien · LocalIATabella VRAM per quant (Q3 a FP16), perdita di qualita misurata, raccomandazioni per GPU e tok/s stimati.

Articolo tradotto. Questa versione e localizzata per evitare pagine internazionali con testo francese. Dati tecnici, prezzi e raccomandazioni restano invariati.
The table that settles it
| Q3_K_M | ~32 GB | +5-8% perplexity | Noticeable: grammar OK, reasoning degraded |
| Q4_K_M | ~44 GB | +2-3% | Good: consumer sweet spot |
| Q5_K_M | ~52 GB | +0.8-1.2% | Very good: indistinguishable in chat |
| Q6_K | ~60 GB | +0.3-0.5% | Near-perfect |
| Q8_0 | ~78 GB | +0.05-0.1% | Indistinguishable from FP16 |
| FP16 | ~146 GB | 0 | Reference, datacenter only |
Which quant for which GPU?
- 24 GB (RTX 4090): Llama 70B does not fit in single GPU at any reasonable quant. Pick a 27-32B model instead.
- 32 GB (RTX 5090): Llama 70B Q3_K_M just fits (~30 GB). Quality slightly degraded.
- 48 GB (A6000, 6000 Ada): Llama 70B Q4_K_M fits with margin. Realistic minimum for single-GPU 70B.
- 64 GB (2× 5090): Llama 70B Q5_K_M or Q6_K, near-FP16 quality.
- 96 GB+ (2× A6000 NVLink, H100, MI300X): Q8 or FP16 for production-critical workloads.
Speed tradeoff (single user on A6000)
| Q3_K_M | ~17-19 tok/s | Less bytes to stream |
| Q4_K_M | ~14-16 tok/s | Speed/quality sweet spot |
| Q5_K_M | ~12-14 tok/s | +10% bytes vs Q4 |
| Q8_0 | ~9-11 tok/s | 2× the weight |
| FP16 | ~5-6 tok/s | 4× the weight |
Default 2026 combo: Llama 3.3 70B Q5_K_M on a 2× RTX 5090 rig (64 GB VRAM). Quality indistinguishable from FP16 in chat/RAG, 12-15 tok/s per stream, 32k context usable without saturating VRAM.
Apri il calcolatore / richiedi un preventivo con modello target, utenti e vincoli.
QuantizationLlamaOttimizzazione