Hoeveel VRAM om een lokale LLM te draaien? (formule + tabel 2026)
De exacte methode om de VRAM van een LLM te schatten: modelgewichten x bytes per parameter, KV-cache, overhead. Een kant-en-klare tabel (7B tot 123B x Q4/Q5/Q8) en de minimale kaart per model.

Vertaald artikel. Deze versie is gelokaliseerd zodat internationale pagina's geen Franse artikeltekst tonen. Technische data, prijzen en adviezen blijven gelijk.
The formula
VRAM is approximately (parameters x bytes per parameter) + KV cache + overhead. The first term is by far the largest, and it is driven by quantization: the fewer bytes per parameter, the smaller the model in memory.
| FP16 (full) | 2 bytes/param | Reference quality, twice the size of Q8. |
| Q8 | ~1 byte/param | Near-FP16 quality. |
| Q5_K_M | ~0.65 byte/param | Best quality/size trade-off for RAG. |
| Q4_K_M | ~0.5 byte/param | Smallest sane size, slight quality loss. |
Weights: the dominant term
Multiply the parameter count by the bytes per parameter. A 70B model in Q4 is about 70 x 0.5 = 35 GB just for the weights; in Q5 about 45 GB; in Q8 about 70 GB; in FP16 about 140 GB.
- 7-8B in Q4: ~4-5 GB, fits any modern GPU.
- 32B in Q4: ~18-20 GB, fits a 24 GB card.
- 70B in Q4: ~35-40 GB, needs 48 GB or two GPUs.
- 123B in Q4: ~62-70 GB, needs two pro GPUs.
KV cache and overhead
The KV cache stores attention keys and values for every token in the context window. It grows with context length and model depth, and can add several GB on long contexts. A fixed overhead of roughly 1-2 GB covers the CUDA context, activations and the framework.
Rule of thumb: keep 15-20% of headroom above the raw weight size for the cache and overhead, more if you use long contexts.
VRAM by model and quantization
| Mistral 7B / Llama 8B | Q4 ~5 GB | Q8 ~8 GB | Any 12 GB GPU |
| Qwen 32B | Q4 ~19 GB | Q5 ~24 GB | RTX 4090 / 5090 |
| Llama 3.3 70B | Q4 ~40 GB | Q5 ~48 GB | 2 GPUs or 1 pro card |
| Mistral Large 123B | Q4 ~68 GB | Q5 ~84 GB | 2x A6000 NVLink |
Rather than guessing, drop your target model into the LocalIA GPU to LLM calculator: it returns the exact VRAM, the quantization that fits and the matching GPUs across 200+ cards.
Open de calculator / vraag ons om advies met doelmodel, gebruikers en randvoorwaarden.