Quantization · 8 min lezen

Q4 vs Q5 vs Q8: welke quantization voor Llama 70B in 2026?

DO
Damien · LocalIA
Gepubliceerd 2026-05-12

VRAM-tabel per quant (Q3 tot FP16), gemeten kwaliteitsverlies, GPU-aanbevelingen en geschatte tok/s per setup.

LocalIA AI rig

Vertaald artikel. Deze versie is gelokaliseerd zodat internationale pagina's geen Franse artikeltekst tonen. Technische data, prijzen en adviezen blijven gelijk.

The table that settles it

Q3_K_M~32 GB+5-8% perplexityNoticeable: grammar OK, reasoning degraded
Q4_K_M~44 GB+2-3%Good: consumer sweet spot
Q5_K_M~52 GB+0.8-1.2%Very good: indistinguishable in chat
Q6_K~60 GB+0.3-0.5%Near-perfect
Q8_0~78 GB+0.05-0.1%Indistinguishable from FP16
FP16~146 GB0Reference, datacenter only

Which quant for which GPU?

  • 24 GB (RTX 4090): Llama 70B does not fit in single GPU at any reasonable quant. Pick a 27-32B model instead.
  • 32 GB (RTX 5090): Llama 70B Q3_K_M just fits (~30 GB). Quality slightly degraded.
  • 48 GB (A6000, 6000 Ada): Llama 70B Q4_K_M fits with margin. Realistic minimum for single-GPU 70B.
  • 64 GB (2× 5090): Llama 70B Q5_K_M or Q6_K, near-FP16 quality.
  • 96 GB+ (2× A6000 NVLink, H100, MI300X): Q8 or FP16 for production-critical workloads.

Speed tradeoff (single user on A6000)

Q3_K_M~17-19 tok/sLess bytes to stream
Q4_K_M~14-16 tok/sSpeed/quality sweet spot
Q5_K_M~12-14 tok/s+10% bytes vs Q4
Q8_0~9-11 tok/s2× the weight
FP16~5-6 tok/s4× the weight
Default 2026 combo: Llama 3.3 70B Q5_K_M on a 2× RTX 5090 rig (64 GB VRAM). Quality indistinguishable from FP16 in chat/RAG, 12-15 tok/s per stream, 32k context usable without saturating VRAM.

Open de calculator / vraag een offerte aan met doelmodel, gebruikers en randvoorwaarden.

QuantizationLlamaOptimalisatie