RAG · 9 min di lettura

RAG sovrano con Qwen 3 30B MoE: lo stack completo 2026

Q: Why is Qwen 3 30B MoE the RAG sweet spot in 2026?

30B total / 3B active per token = the quality of a 30B at the speed of a 3B. 50-65 tok/s single user, 280 tok/s in 5-user batching on a Pro build. Ideal for a team RAG of 5-30 people with an unbeatable quality/throughput ratio.

Q: How much VRAM for Qwen 3 30B MoE in production RAG?

22 GB in Q5_K_M (very comfortable on a single RTX 5090 32 GB), 18 GB in Q4_K_M. It lets you keep Qdrant + embeddings + RAG context on the same GPU without saturating it. For aggressive batching, take 2x RTX 5090.

Q: Full sovereign RAG stack with Qwen 3 MoE?

vLLM (serving) + Qwen 3 30B-A3B (LLM) + nomic-embed v2 (768d embeddings) + Qdrant (Rust vector DB) + LlamaIndex or Haystack (orchestration) + bge-reranker-v2-m3 (reranking). All open-weight, 100% self-hosted.

Q: Expected performance on a LocalIA Pro build?

Query latency (top-5 retrieval): ~80 ms. Time to first token: ~250 ms (100 ms on a prefix cache hit). Single-user generation: 55-65 tok/s. 5-user batching throughput: ~280 tok/s. Electricity cost: ~EUR 0.28 / 1M tokens.

Q: Qwen is Chinese, is that a sovereignty problem?

No if you control execution. The weights are open-weight (Apache 2.0). Once downloaded and running on your rig, the model sends nothing to anyone. Same logic as Linux (Linus Torvalds is Finnish): origin does not compromise sovereignty.

Damien · LocalIA

Pubblicato 2026-05-12

Perche Qwen 3 30B-A3B (MoE, 3B param attivi/token) e lo sweet spot 2026 per un RAG sovrano di team. Stack: vLLM + Qdrant + nomic-embed + LlamaIndex su un rig Pro (11.990 EUR). Tutto open-weight, tutto self-hosted.

Articolo tradotto. Questa versione e localizzata per evitare pagine internazionali con testo francese. Dati tecnici, prezzi e raccomandazioni restano invariati.

Why a MoE is perfect for RAG

A Mixture of Experts like Qwen 3 30B-A3B contains 128 specialized experts. At each token, a router activates 8 of them. Memory needs the full 30B in VRAM, but compute only touches 3B active params per token — 10× faster than a dense 30B. For RAG, where most queries are answer using this context, MoE experts in summarization/extraction get activated while the rest stay dormant. You get impressive server throughput without sacrificing quality.

VRAM requirements

FP16	~62 GB	1× A6000 (48 GB) + offload, or 2× RTX 5090
Q8_0	~33 GB	1× RTX 5090 (32 GB) — tight
Q5_K_M	~22 GB	1× RTX 4090 (24 GB) — comfortable
Q4_K_M	~18 GB	1× RTX 5090 32 GB — large margin

The complete RAG stack

Generation: Qwen 3 30B-A3B via vLLM with --enable-prefix-caching (critical in RAG: retrieved chunks are reused).
Embeddings: nomic-embed v2 (open-weight, 768 dim, multilingual, ~5,000 emb/sec on RTX 5090).
Vector DB: Qdrant (Rust, Apache 2.0, self-hosted in 1 Docker command, <20ms latency for 100k docs).
Orchestration: LlamaIndex (data-oriented), LangChain (ops-oriented) or Haystack 2.0 (EU production).
Reranking: bge-reranker-v2-m3 (open-weight, ~600 MB) for 100% sovereignty, or Cohere Rerank 3 if you accept an external API.

Expected performance on LocalIA Pro rig

Query latency (top-5 retrieval)	~80 ms	Qdrant + embedding
Time to first token	~250 ms cold · ~100 ms with prefix cache hit	vLLM path
Generation single user	55-65 tok/s	Qwen 3 30B Q4_K_M
Throughput batching 5 users	~280 tok/s	Prefix cache active
Electricity cost per 1M tokens	~EUR 0.28	650 W average × EUR 0.25/kWh

Qwen 3 30B-A3B + Qdrant + nomic-embed + vLLM is the most efficient sovereign RAG stack of 2026 for a 5-30 person team. It fits on a Pro rig (EUR 11,990 ex VAT), scales in batching, and every component is open-weight or open-source.

Apri il calcolatore / chiedici un consiglio con modello target, utenti e vincoli.

Domande frequenti

Why is Qwen 3 30B MoE the RAG sweet spot in 2026?+

30B total / 3B active per token = the quality of a 30B at the speed of a 3B. 50-65 tok/s single user, 280 tok/s in 5-user batching on a Pro build. Ideal for a team RAG of 5-30 people with an unbeatable quality/throughput ratio.

How much VRAM for Qwen 3 30B MoE in production RAG?+

22 GB in Q5_K_M (very comfortable on a single RTX 5090 32 GB), 18 GB in Q4_K_M. It lets you keep Qdrant + embeddings + RAG context on the same GPU without saturating it. For aggressive batching, take 2x RTX 5090.

Full sovereign RAG stack with Qwen 3 MoE?+

vLLM (serving) + Qwen 3 30B-A3B (LLM) + nomic-embed v2 (768d embeddings) + Qdrant (Rust vector DB) + LlamaIndex or Haystack (orchestration) + bge-reranker-v2-m3 (reranking). All open-weight, 100% self-hosted.

Expected performance on a LocalIA Pro build?+

Query latency (top-5 retrieval): ~80 ms. Time to first token: ~250 ms (100 ms on a prefix cache hit). Single-user generation: 55-65 tok/s. 5-user batching throughput: ~280 tok/s. Electricity cost: ~EUR 0.28 / 1M tokens.

Qwen is Chinese, is that a sovereignty problem?+

No if you control execution. The weights are open-weight (Apache 2.0). Once downloaded and running on your rig, the model sends nothing to anyone. Same logic as Linux (Linus Torvalds is Finnish): origin does not compromise sovereignty.

RAGQwenSovranitaMoE

X Reddit LinkedIn