Quantization · 8 min lezen

Q4 vs Q5 vs Q8: welke quantization voor Llama 70B in 2026?

DO
Damien · LocalIA
Gepubliceerd 2026-05-12

VRAM-tabel per quant (Q3 tot FP16), gemeten kwaliteitsverlies, GPU-aanbevelingen en geschatte tok/s per setup.

LocalIA AI rig

Vertaald artikel. Deze versie is gelokaliseerd zodat internationale pagina's geen Franse artikeltekst tonen. Technische data, prijzen en adviezen blijven gelijk.

The table that settles it

Q3_K_M~32 GB+5-8% perplexityNoticeable: grammar OK, reasoning degraded
Q4_K_M~44 GB+2-3%Good: consumer sweet spot
Q5_K_M~52 GB+0.8-1.2%Very good: indistinguishable in chat
Q6_K~60 GB+0.3-0.5%Near-perfect
Q8_0~78 GB+0.05-0.1%Indistinguishable from FP16
FP16~146 GB0Reference, datacenter only

Which quant for which GPU?

  • 24 GB (RTX 4090): Llama 70B does not fit in single GPU at any reasonable quant. Pick a 27-32B model instead.
  • 32 GB (RTX 5090): Llama 70B Q3_K_M just fits (~30 GB). Quality slightly degraded.
  • 48 GB (A6000, 6000 Ada): Llama 70B Q4_K_M fits with margin. Realistic minimum for single-GPU 70B.
  • 64 GB (2× 5090): Llama 70B Q5_K_M or Q6_K, near-FP16 quality.
  • 96 GB+ (2× A6000 NVLink, H100, MI300X): Q8 or FP16 for production-critical workloads.

Speed tradeoff (single user on A6000)

Q3_K_M~17-19 tok/sLess bytes to stream
Q4_K_M~14-16 tok/sSpeed/quality sweet spot
Q5_K_M~12-14 tok/s+10% bytes vs Q4
Q8_0~9-11 tok/s2× the weight
FP16~5-6 tok/s4× the weight
Default 2026 combo: Llama 3.3 70B Q5_K_M on a 2× RTX 5090 rig (64 GB VRAM). Quality indistinguishable from FP16 in chat/RAG, 12-15 tok/s per stream, 32k context usable without saturating VRAM.

Open de calculator / vraag ons om advies met doelmodel, gebruikers en randvoorwaarden.

Veelgestelde vragen

Which quantization for Llama 3.3 70B locally?+
Q5_K_M is the production sweet spot: 52 GB VRAM, quality loss below 1.5% (perplexity), indistinguishable from FP16 in chat. Q4_K_M (44 GB) if VRAM is limited. Q8 (78 GB) only for critical production.
What quality difference between Q4 and Q8?+
Q4_K_M: +2-3% perplexity vs FP16, noticeable in complex reasoning. Q5_K_M: +0.8-1.2%, indistinguishable in everyday chat. Q8: +0.05-0.1%, indistinguishable in benchmarks. For 95% of business cases, Q5_K_M is plenty.
Q4_K_M or Q4_K_S, which variant?+
K_M (Medium) by default: a quality/size balance. K_S (Small): -5% VRAM, -1% quality (aggressive). K_L (Large): +5% VRAM, +0.5% quality (cautious). For the vast majority, K_M is the right pick.
Is FP16 really worth it versus Q5_K_M?+
No for local production. FP16 = 146 GB VRAM for 70B (datacenter only). For 0.8% extra quality you triple the VRAM. Q5_K_M on 2x RTX 5090 (64 GB) covers 99% of real needs.
How do you test a quantization's quality on your use case?+
Compare Q4_K_M, Q5_K_M and Q8 on 20-30 prompts representative of your field. Criteria: coherence, factual accuracy, language quality. The perceived difference between Q5 and Q8 is usually small.
QuantizationLlamaOptimalisatie