VRAM · 8 Min. Lesezeit

Wie viel VRAM für ein lokales LLM? (Formel + Tabelle 2026)

Damien · LocalIA

Veröffentlicht 2026-06-05

Die exakte Methode zur Schätzung des VRAM eines LLM: Modellgewichte x Bytes pro Parameter, KV-Cache, Overhead. Eine sofort nutzbare Tabelle (7B bis 123B x Q4/Q5/Q8) und die Mindest-Karte je Modell.

Uebersetzter Artikel. Diese Version ist lokalisiert, damit internationale Seiten keinen franzoesischen Artikeltext anzeigen. Technische Daten, Preise und Empfehlungen bleiben gleich.

The formula

VRAM is approximately (parameters x bytes per parameter) + KV cache + overhead. The first term is by far the largest, and it is driven by quantization: the fewer bytes per parameter, the smaller the model in memory.

FP16 (full)	2 bytes/param	Reference quality, twice the size of Q8.
Q8	~1 byte/param	Near-FP16 quality.
Q5_K_M	~0.65 byte/param	Best quality/size trade-off for RAG.
Q4_K_M	~0.5 byte/param	Smallest sane size, slight quality loss.

Weights: the dominant term

Multiply the parameter count by the bytes per parameter. A 70B model in Q4 is about 70 x 0.5 = 35 GB just for the weights; in Q5 about 45 GB; in Q8 about 70 GB; in FP16 about 140 GB.

7-8B in Q4: ~4-5 GB, fits any modern GPU.
32B in Q4: ~18-20 GB, fits a 24 GB card.
70B in Q4: ~35-40 GB, needs 48 GB or two GPUs.
123B in Q4: ~62-70 GB, needs two pro GPUs.

KV cache and overhead

The KV cache stores attention keys and values for every token in the context window. It grows with context length and model depth, and can add several GB on long contexts. A fixed overhead of roughly 1-2 GB covers the CUDA context, activations and the framework.

Rule of thumb: keep 15-20% of headroom above the raw weight size for the cache and overhead, more if you use long contexts.

VRAM by model and quantization

Mistral 7B / Llama 8B	Q4 ~5 GB	Q8 ~8 GB	Any 12 GB GPU
Qwen 32B	Q4 ~19 GB	Q5 ~24 GB	RTX 4090 / 5090
Llama 3.3 70B	Q4 ~40 GB	Q5 ~48 GB	2 GPUs or 1 pro card
Mistral Large 123B	Q4 ~68 GB	Q5 ~84 GB	2x A6000 NVLink

Rather than guessing, drop your target model into the LocalIA GPU to LLM calculator: it returns the exact VRAM, the quantization that fits and the matching GPUs across 200+ cards.

Rechner öffnen / frag uns um Rat mit Zielmodell, Nutzern und Randbedingungen.

Häufig gestellte Fragen

Wie viel VRAM braucht man für ein 70B-LLM lokal?+

Etwa 40 GB in Q4, 50 GB in Q5 und 78 GB in Q8 (Gewichte + ~12 % Overhead). Ein Llama 3.3 70B oder Qwen 2.5 72B in Q4/Q5 passt auf 2x RTX 5090 (64 GB kombinierte VRAM).

Wie berechnet man die benötigte VRAM eines Modells?+

VRAM ist ungefähr (Milliarden Parameter x Bytes pro Parameter + KV-Cache) x 1,12. Bytes pro Parameter: 2 in FP16, 1 in Q8, 0,625 in Q5, 0,5 in Q4. Der KV-Cache wächst mit der Kontextlänge.

Wie viel VRAM für ein kleines 7B- oder 8B-Modell?+

Etwa 5 bis 6 GB in Q4/Q5. Ein Mistral 7B oder Llama 3.1 8B passt bequem auf eine Einsteigerkarte wie eine RTX 3060 12 GB, oder sogar eine 8-GB-Karte in Q4.

Wie viel VRAM verbraucht der KV-Cache?+

Größenordnung: 0,1 bis 0,5 GB pro 1.000 Kontext-Tokens für ein 70B-Modell (weniger bei kleineren). Bei langen Prompts (RAG) sollten Sie 10-20 % Reserve allein für den KV-Cache lassen.

Wie reduziert man die für ein LLM benötigte VRAM?+

Drei Hebel: eine Quantization-Stufe runter (Q8 zu Q4 halbiert fast die Gewichts-VRAM), das Kontextfenster begrenzen, oder ein MoE-Modell wählen, das nur einen Bruchteil seiner Parameter pro Token aktiviert.

VRAMGPULeitfaden

X Reddit LinkedIn