GPU · 9 min di lettura

Quale GPU serve per eseguire Llama 3.3 70B in locale nel 2026?

DO
Damien · LocalIA
Pubblicato 2026-05-07

VRAM per quantization, GPU compatibili, RTX 5090 vs A6000 vs H100 e confronto costo/prestazioni rispetto alle API OpenAI.

LocalIA AI rig

Llama 3.3 70B e un modello di riferimento per RAG e agenti locali. Serve abbastanza VRAM e una quantization scelta bene.

VRAM per quantization

Q4_K_M~47 GBQualita accettabile, non sta su una sola GPU consumer.
Q5_K_M~58 GBQualita molto buona, consigliata per RAG.
Q8~84 GBQualita quasi FP16.
FP16~168 GBPrecisione di riferimento, livello datacenter.

Hardware tipico

  • 24-32 GB: meglio modelli piu piccoli o CPU offload.
  • 48-64 GB: sweet spot 2026, soprattutto 2 RTX 5090 per Q5.
  • 80+ GB: Q8 e grandi MoE diventano realistici.
Prima di comprare, testa il modello nel calcolatore LocalIA e verifica il margine VRAM.

Apri il calcolatore / chiedici un consiglio con modello target, utenti e vincoli.

Domande frequenti

What is the minimum GPU to run Llama 3.3 70B locally?+
An RTX 5090 (32 GB) in Q3_K_M works with quality trade-offs. For comfortable Q5_K_M: 2x RTX 5090 (64 GB) or 1x A6000 (48 GB). For Q8 production: 2x A6000 NVLink (96 GB) or 1x H100 80 GB.
How much VRAM for Llama 70B in Q4?+
Around 44 GB in Q4_K_M (params x 0.5625 bytes + 20% overhead for KV cache and context). So 48 GB minimum (A6000) or 64 GB (2x 5090). The 32 GB of a single 5090 is not enough for Q4 without offload.
What differences between Q3, Q4, Q5 and Q8 on Llama 70B?+
Q3: 32 GB VRAM, degraded reasoning quality. Q4: 44 GB, the consumer sweet spot. Q5: 52 GB, indistinguishable from FP16 in chat. Q8: 78 GB, near-FP16 for production. For most uses, Q5_K_M is the best quality/VRAM trade-off.
RTX 4090 or RTX 5090 for Llama 70B in 2026?+
The RTX 5090 (32 GB) wins on VRAM capacity (+33%), bandwidth (~1,800 GB/s vs 1,000) and FP4 support. In mid-2026 both sit at similar prices due to the AI shortage, so prefer the 5090 to be future-proof.
Is a Pro build (~EUR 11,990) well sized for Llama 70B?+
Yes, it is the 2026 sweet spot. 2x RTX 5090 = 64 GB total VRAM via vLLM tensor parallelism. Llama 3.3 70B Q5_K_M runs at 28-35 tok/s single-user and reaches 90-100 tok/s combined in batching for 5 users.
GPULlamaRAG