VRAM · 8 min de lectura

¿Cuánta VRAM para ejecutar un LLM en local? (fórmula + tabla 2026)

DO
Damien · LocalIA
Publicado 2026-06-05

El método exacto para estimar la VRAM de un LLM: peso del modelo x bytes por parámetro, caché KV, overhead. Tabla lista para usar (7B a 123B x Q4/Q5/Q8) y la tarjeta mínima por modelo.

LocalIA AI rig

Articulo traducido. Esta version esta localizada para evitar mezclar interfaces internacionales con texto frances. Los datos tecnicos, importes y recomendaciones se mantienen iguales.

The formula

VRAM is approximately (parameters x bytes per parameter) + KV cache + overhead. The first term is by far the largest, and it is driven by quantization: the fewer bytes per parameter, the smaller the model in memory.

FP16 (full)2 bytes/paramReference quality, twice the size of Q8.
Q8~1 byte/paramNear-FP16 quality.
Q5_K_M~0.65 byte/paramBest quality/size trade-off for RAG.
Q4_K_M~0.5 byte/paramSmallest sane size, slight quality loss.

Weights: the dominant term

Multiply the parameter count by the bytes per parameter. A 70B model in Q4 is about 70 x 0.5 = 35 GB just for the weights; in Q5 about 45 GB; in Q8 about 70 GB; in FP16 about 140 GB.

  • 7-8B in Q4: ~4-5 GB, fits any modern GPU.
  • 32B in Q4: ~18-20 GB, fits a 24 GB card.
  • 70B in Q4: ~35-40 GB, needs 48 GB or two GPUs.
  • 123B in Q4: ~62-70 GB, needs two pro GPUs.

KV cache and overhead

The KV cache stores attention keys and values for every token in the context window. It grows with context length and model depth, and can add several GB on long contexts. A fixed overhead of roughly 1-2 GB covers the CUDA context, activations and the framework.

Rule of thumb: keep 15-20% of headroom above the raw weight size for the cache and overhead, more if you use long contexts.

VRAM by model and quantization

Mistral 7B / Llama 8BQ4 ~5 GBQ8 ~8 GBAny 12 GB GPU
Qwen 32BQ4 ~19 GBQ5 ~24 GBRTX 4090 / 5090
Llama 3.3 70BQ4 ~40 GBQ5 ~48 GB2 GPUs or 1 pro card
Mistral Large 123BQ4 ~68 GBQ5 ~84 GB2x A6000 NVLink
Rather than guessing, drop your target model into the LocalIA GPU to LLM calculator: it returns the exact VRAM, the quantization that fits and the matching GPUs across 200+ cards.

Abre la calculadora / escríbenos para un consejo con tu modelo objetivo, usuarios y restricciones.

Preguntas frecuentes

How much VRAM do you need for a 70B LLM locally?+
Around 40 GB in Q4, 50 GB in Q5 and 78 GB in Q8 (weights + ~12% overhead). A Llama 3.3 70B or Qwen 2.5 72B in Q4/Q5 fits on 2x RTX 5090 (64 GB combined VRAM).
How do you calculate the VRAM a model needs?+
VRAM is approximately (billions of parameters x bytes per parameter + KV cache) x 1.12. Bytes per parameter: 2 in FP16, 1 in Q8, 0.625 in Q5, 0.5 in Q4. The KV cache grows with context length.
How much VRAM for a small 7B or 8B model?+
About 5 to 6 GB in Q4/Q5. A Mistral 7B or Llama 3.1 8B fits comfortably on an entry-level card like an RTX 3060 12 GB, or even an 8 GB card in Q4.
How much VRAM does the KV cache use?+
As an order of magnitude, 0.1 to 0.5 GB per 1,000 tokens of context for a 70B model (less for smaller ones). On long prompts (RAG), keep 10-20% headroom just for the KV cache.
How do you reduce the VRAM needed to run an LLM?+
Three levers: drop one quantization step (Q8 to Q4 almost halves the weight VRAM), limit the context window, or pick a MoE model that activates only a fraction of its parameters per token.
VRAMGPUGuía