RAG sovrano con Qwen 3 30B MoE: lo stack completo 2026
Perche Qwen 3 30B-A3B (MoE, 3B param attivi/token) e lo sweet spot 2026 per un RAG sovrano di team. Stack: vLLM + Qdrant + nomic-embed + LlamaIndex su un rig Pro (11.990 EUR). Tutto open-weight, tutto self-hosted.

Articolo tradotto. Questa versione e localizzata per evitare pagine internazionali con testo francese. Dati tecnici, prezzi e raccomandazioni restano invariati.
Why a MoE is perfect for RAG
A Mixture of Experts like Qwen 3 30B-A3B contains 128 specialized experts. At each token, a router activates 8 of them. Memory needs the full 30B in VRAM, but compute only touches 3B active params per token — 10× faster than a dense 30B. For RAG, where most queries are answer using this context, MoE experts in summarization/extraction get activated while the rest stay dormant. You get impressive server throughput without sacrificing quality.
VRAM requirements
| FP16 | ~62 GB | 1× A6000 (48 GB) + offload, or 2× RTX 5090 |
| Q8_0 | ~33 GB | 1× RTX 5090 (32 GB) — tight |
| Q5_K_M | ~22 GB | 1× RTX 4090 (24 GB) — comfortable |
| Q4_K_M | ~18 GB | 1× RTX 5090 32 GB — large margin |
The complete RAG stack
- Generation: Qwen 3 30B-A3B via vLLM with --enable-prefix-caching (critical in RAG: retrieved chunks are reused).
- Embeddings: nomic-embed v2 (open-weight, 768 dim, multilingual, ~5,000 emb/sec on RTX 5090).
- Vector DB: Qdrant (Rust, Apache 2.0, self-hosted in 1 Docker command, <20ms latency for 100k docs).
- Orchestration: LlamaIndex (data-oriented), LangChain (ops-oriented) or Haystack 2.0 (EU production).
- Reranking: bge-reranker-v2-m3 (open-weight, ~600 MB) for 100% sovereignty, or Cohere Rerank 3 if you accept an external API.
Expected performance on LocalIA Pro rig
| Query latency (top-5 retrieval) | ~80 ms | Qdrant + embedding |
| Time to first token | ~250 ms cold · ~100 ms with prefix cache hit | vLLM path |
| Generation single user | 55-65 tok/s | Qwen 3 30B Q4_K_M |
| Throughput batching 5 users | ~280 tok/s | Prefix cache active |
| Electricity cost per 1M tokens | ~EUR 0.28 | 650 W average × EUR 0.25/kWh |
Qwen 3 30B-A3B + Qdrant + nomic-embed + vLLM is the most efficient sovereign RAG stack of 2026 for a 5-30 person team. It fits on a Pro rig (EUR 11,990 ex VAT), scales in batching, and every component is open-weight or open-source.
Apri il calcolatore / richiedi un preventivo con modello target, utenti e vincoli.