RAG · 9 min read

Sovereign RAG with Qwen 3 30B MoE: the complete 2026 stack

DO
Damien · LocalIA
Published 2026-05-12

Why Qwen 3 30B-A3B (MoE, 3B active params/token) is the 2026 sweet spot for a sovereign team RAG. Stack: vLLM + Qdrant + nomic-embed + LlamaIndex, on a Pro rig (EUR 11,990). All open-weight, fully self-hosted.

LocalIA AI rig

Qwen 3 30B-A3B is a Chinese open-weight MoE model released early 2026 with one decisive property for RAG: 30 billion total parameters but only 3 billion active per token. You get 30B quality at 3B speed. That makes it the 2026 sweet spot for sovereign team RAG on a modest rig.

Why a MoE is perfect for RAG

A Mixture of Experts like Qwen 3 30B-A3B contains 128 specialized experts. At each token, a router activates 8 of them. Memory needs the full 30B in VRAM, but compute only touches 3B active params per token — 10× faster than a dense 30B. For RAG, where most queries are answer using this context, MoE experts in summarization/extraction get activated while the rest stay dormant. You get impressive server throughput without sacrificing quality.

VRAM requirements

FP16~62 GB1× A6000 (48 GB) + offload, or 2× RTX 5090
Q8_0~33 GB1× RTX 5090 (32 GB) — tight
Q5_K_M~22 GB1× RTX 4090 (24 GB) — comfortable
Q4_K_M~18 GB1× RTX 5090 32 GB — large margin

The complete RAG stack

  • Generation: Qwen 3 30B-A3B via vLLM with --enable-prefix-caching (critical in RAG: retrieved chunks are reused).
  • Embeddings: nomic-embed v2 (open-weight, 768 dim, multilingual, ~5,000 emb/sec on RTX 5090).
  • Vector DB: Qdrant (Rust, Apache 2.0, self-hosted in 1 Docker command, <20ms latency for 100k docs).
  • Orchestration: LlamaIndex (data-oriented), LangChain (ops-oriented) or Haystack 2.0 (EU production).
  • Reranking: bge-reranker-v2-m3 (open-weight, ~600 MB) for 100% sovereignty, or Cohere Rerank 3 if you accept an external API.

Expected performance on LocalIA Pro rig

Query latency (top-5 retrieval)~80 msQdrant + embedding
Time to first token~250 ms cold · ~100 ms with prefix cache hitvLLM path
Generation single user55-65 tok/sQwen 3 30B Q4_K_M
Throughput batching 5 users~280 tok/sPrefix cache active
Electricity cost per 1M tokens~EUR 0.28650 W average × EUR 0.25/kWh
Qwen 3 30B-A3B + Qdrant + nomic-embed + vLLM is the most efficient sovereign RAG stack of 2026 for a 5-30 person team. It fits on a Pro rig (EUR 11,990 ex VAT), scales in batching, and every component is open-weight or open-source.

Open the calculator / request a quote with your target model, users and constraints.

RAGQwenSovereigntyMoE