Quantization · 8 min di lettura

Q4 vs Q5 vs Q8: quale quantization per Llama 70B nel 2026?

DO
Damien · LocalIA
Pubblicato 2026-05-12

Tabella VRAM per quant (Q3 a FP16), perdita di qualita misurata, raccomandazioni per GPU e tok/s stimati.

LocalIA AI rig

Articolo tradotto. Questa versione e localizzata per evitare pagine internazionali con testo francese. Dati tecnici, prezzi e raccomandazioni restano invariati.

The table that settles it

Q3_K_M~32 GB+5-8% perplexityNoticeable: grammar OK, reasoning degraded
Q4_K_M~44 GB+2-3%Good: consumer sweet spot
Q5_K_M~52 GB+0.8-1.2%Very good: indistinguishable in chat
Q6_K~60 GB+0.3-0.5%Near-perfect
Q8_0~78 GB+0.05-0.1%Indistinguishable from FP16
FP16~146 GB0Reference, datacenter only

Which quant for which GPU?

  • 24 GB (RTX 4090): Llama 70B does not fit in single GPU at any reasonable quant. Pick a 27-32B model instead.
  • 32 GB (RTX 5090): Llama 70B Q3_K_M just fits (~30 GB). Quality slightly degraded.
  • 48 GB (A6000, 6000 Ada): Llama 70B Q4_K_M fits with margin. Realistic minimum for single-GPU 70B.
  • 64 GB (2× 5090): Llama 70B Q5_K_M or Q6_K, near-FP16 quality.
  • 96 GB+ (2× A6000 NVLink, H100, MI300X): Q8 or FP16 for production-critical workloads.

Speed tradeoff (single user on A6000)

Q3_K_M~17-19 tok/sLess bytes to stream
Q4_K_M~14-16 tok/sSpeed/quality sweet spot
Q5_K_M~12-14 tok/s+10% bytes vs Q4
Q8_0~9-11 tok/s2× the weight
FP16~5-6 tok/s4× the weight
Default 2026 combo: Llama 3.3 70B Q5_K_M on a 2× RTX 5090 rig (64 GB VRAM). Quality indistinguishable from FP16 in chat/RAG, 12-15 tok/s per stream, 32k context usable without saturating VRAM.

Apri il calcolatore / chiedici un consiglio con modello target, utenti e vincoli.

Domande frequenti

Which quantization for Llama 3.3 70B locally?+
Q5_K_M is the production sweet spot: 52 GB VRAM, quality loss below 1.5% (perplexity), indistinguishable from FP16 in chat. Q4_K_M (44 GB) if VRAM is limited. Q8 (78 GB) only for critical production.
What quality difference between Q4 and Q8?+
Q4_K_M: +2-3% perplexity vs FP16, noticeable in complex reasoning. Q5_K_M: +0.8-1.2%, indistinguishable in everyday chat. Q8: +0.05-0.1%, indistinguishable in benchmarks. For 95% of business cases, Q5_K_M is plenty.
Q4_K_M or Q4_K_S, which variant?+
K_M (Medium) by default: a quality/size balance. K_S (Small): -5% VRAM, -1% quality (aggressive). K_L (Large): +5% VRAM, +0.5% quality (cautious). For the vast majority, K_M is the right pick.
Is FP16 really worth it versus Q5_K_M?+
No for local production. FP16 = 146 GB VRAM for 70B (datacenter only). For 0.8% extra quality you triple the VRAM. Q5_K_M on 2x RTX 5090 (64 GB) covers 99% of real needs.
How do you test a quantization's quality on your use case?+
Compare Q4_K_M, Q5_K_M and Q8 on 20-30 prompts representative of your field. Criteria: coherence, factual accuracy, language quality. The perceived difference between Q5 and Q8 is usually small.
QuantizationLlamaOttimizzazione