RAG · 9 Min. Lesezeit

Souveraenes RAG mit Qwen 3 30B MoE: der vollstaendige 2026er Stack

DO
Damien · LocalIA
Veröffentlicht 2026-05-12

Warum Qwen 3 30B-A3B (MoE, 3B aktive Params/Token) der 2026er Sweet Spot fuer souveraenes Team-RAG ist. Stack: vLLM + Qdrant + nomic-embed + LlamaIndex auf einem Pro-Rig (11.990 EUR). Alles open-weight, alles self-hosted.

LocalIA AI rig

Uebersetzter Artikel. Diese Version ist lokalisiert, damit internationale Seiten keinen franzoesischen Artikeltext anzeigen. Technische Daten, Preise und Empfehlungen bleiben gleich.

Why a MoE is perfect for RAG

A Mixture of Experts like Qwen 3 30B-A3B contains 128 specialized experts. At each token, a router activates 8 of them. Memory needs the full 30B in VRAM, but compute only touches 3B active params per token — 10× faster than a dense 30B. For RAG, where most queries are answer using this context, MoE experts in summarization/extraction get activated while the rest stay dormant. You get impressive server throughput without sacrificing quality.

VRAM requirements

FP16~62 GB1× A6000 (48 GB) + offload, or 2× RTX 5090
Q8_0~33 GB1× RTX 5090 (32 GB) — tight
Q5_K_M~22 GB1× RTX 4090 (24 GB) — comfortable
Q4_K_M~18 GB1× RTX 5090 32 GB — large margin

The complete RAG stack

  • Generation: Qwen 3 30B-A3B via vLLM with --enable-prefix-caching (critical in RAG: retrieved chunks are reused).
  • Embeddings: nomic-embed v2 (open-weight, 768 dim, multilingual, ~5,000 emb/sec on RTX 5090).
  • Vector DB: Qdrant (Rust, Apache 2.0, self-hosted in 1 Docker command, <20ms latency for 100k docs).
  • Orchestration: LlamaIndex (data-oriented), LangChain (ops-oriented) or Haystack 2.0 (EU production).
  • Reranking: bge-reranker-v2-m3 (open-weight, ~600 MB) for 100% sovereignty, or Cohere Rerank 3 if you accept an external API.

Expected performance on LocalIA Pro rig

Query latency (top-5 retrieval)~80 msQdrant + embedding
Time to first token~250 ms cold · ~100 ms with prefix cache hitvLLM path
Generation single user55-65 tok/sQwen 3 30B Q4_K_M
Throughput batching 5 users~280 tok/sPrefix cache active
Electricity cost per 1M tokens~EUR 0.28650 W average × EUR 0.25/kWh
Qwen 3 30B-A3B + Qdrant + nomic-embed + vLLM is the most efficient sovereign RAG stack of 2026 for a 5-30 person team. It fits on a Pro rig (EUR 11,990 ex VAT), scales in batching, and every component is open-weight or open-source.

Rechner öffnen / frag uns um Rat mit Zielmodell, Nutzern und Randbedingungen.

Häufig gestellte Fragen

Why is Qwen 3 30B MoE the RAG sweet spot in 2026?+
30B total / 3B active per token = the quality of a 30B at the speed of a 3B. 50-65 tok/s single user, 280 tok/s in 5-user batching on a Pro build. Ideal for a team RAG of 5-30 people with an unbeatable quality/throughput ratio.
How much VRAM for Qwen 3 30B MoE in production RAG?+
22 GB in Q5_K_M (very comfortable on a single RTX 5090 32 GB), 18 GB in Q4_K_M. It lets you keep Qdrant + embeddings + RAG context on the same GPU without saturating it. For aggressive batching, take 2x RTX 5090.
Full sovereign RAG stack with Qwen 3 MoE?+
vLLM (serving) + Qwen 3 30B-A3B (LLM) + nomic-embed v2 (768d embeddings) + Qdrant (Rust vector DB) + LlamaIndex or Haystack (orchestration) + bge-reranker-v2-m3 (reranking). All open-weight, 100% self-hosted.
Expected performance on a LocalIA Pro build?+
Query latency (top-5 retrieval): ~80 ms. Time to first token: ~250 ms (100 ms on a prefix cache hit). Single-user generation: 55-65 tok/s. 5-user batching throughput: ~280 tok/s. Electricity cost: ~EUR 0.28 / 1M tokens.
Qwen is Chinese, is that a sovereignty problem?+
No if you control execution. The weights are open-weight (Apache 2.0). Once downloaded and running on your rig, the model sends nothing to anyone. Same logic as Linux (Linus Torvalds is Finnish): origin does not compromise sovereignty.
RAGQwenSouveraenitaetMoE