RAG · 9 min di lettura

Costruire un RAG locale nel 2026: stack Ollama + Qdrant + LlamaIndex

DO
Damien · LocalIA
Pubblicato 2026-05-12

Architettura in 4 mattoni, scelte tecniche, sizing GPU per utenti concorrenti e TCO a 24 mesi vs GPT-4o.

LocalIA AI rig

Articolo tradotto. Questa versione e localizzata per evitare pagine internazionali con testo francese. Dati tecnici, prezzi e raccomandazioni restano invariati.

Four-brick architecture

EmbedderTurns text into vectorsBGE-M3, mxbai-embed-large or nomic-embed-text
Vector DBStores and searches nearest vectorsQdrant (recommended) or Chroma
LLMGenerates the answer from retrieved passagesLlama 3.3 70B Q4 or Qwen 2.5 32B Q5
OrchestratorChains everything, prompts, anti-hallucinationLlamaIndex (recommended) or Haystack

GPU sizing by team size (not just by concurrent users)

There are two numbers to distinguish: the team that shares the rig (the commercial number that matters for your quote), and the simultaneous active users (technical number, peak of requests processed in parallel at the same instant). They are very different.

A typical RAG user types a question, waits 5-15s for the answer, reads 30-60s, types again. So with 10 people permanently connected, only ~1-2 actually have a request executing at the same instant. vLLM continuous batching absorbs the queue — requests stack up without anyone noticing, perceived latency 2-5s even when 3 colleagues hit send in the same second.

24 GB (RTX 4090)Qwen 32B Q5 or Llama 70B Q35-10 people1-2 active
32 GB (RTX 5090)Llama 70B Q4 or Qwen 32B Q810-20 people2-3 active
48 GB (A6000)Llama 70B Q5 + KV cache headroom20-35 people3-5 active
64 GB (2× 5090)Llama 70B Q5 + vLLM batching40-80 people5-10 active
96 GB (2× A6000 NVLink)Llama 70B FP16 or Mistral Large80-150 people10-20 active

Default LocalIA stack

  • Ubuntu 24.04 LTS + NVIDIA drivers + CUDA 12.6
  • vLLM (OpenAI-compatible server) + Ollama (model management)
  • Qwen 2.5 32B Q5 by default (quality/throughput/VRAM sweet spot)
  • Qdrant + LlamaIndex + Open WebUI
  • Embeddings BGE-M3, re-ranker BGE-Reranker-v2-m3

Common pitfalls

  • Chunks too large (>2000 tokens dilute the re-ranking signal)
  • No re-ranker — adding BGE-Reranker doubles RAG precision
  • Wrong embeddings — BGE-M3 multilingual beats MPNet on FR/EN corpora
  • No metadata filters — date, department, client tagging is mandatory in enterprise
  • Tight VRAM — 90% utilization breaks the moment a user sends 8k tokens
At ~15M tokens/month in RAG, a Pro rig pays back in 11 months versus GPT-4o. Above 25M tokens/month, the Enterprise tier (2× A6000 NVLink) becomes the right call.

Apri il calcolatore / chiedici un consiglio con modello target, utenti e vincoli.

Domande frequenti

Which local RAG stack performs best in 2026?+
The winning combination: vLLM (serving) + Qwen 3 30B MoE (LLM) + nomic-embed v2 (embeddings) + Qdrant (vector DB) + LlamaIndex (orchestration). All open-weight, self-hosted, GDPR-compliant.
How much VRAM for a local team RAG in 2026?+
64 GB VRAM (2x RTX 5090) is enough for 5-10 concurrent users with Llama 70B Q5 or Qwen 3 30B MoE. It enables the vLLM batching needed for multi-user production and the KV cache for 32k context.
Do you need a GPU for RAG embeddings or is CPU enough?+
A GPU is mandatory in production. On an RTX 5090, nomic-embed v2 does ~5,000 embeddings/sec. On pure CPU it is ~200/sec, so initial indexing of a 100k-doc corpus takes 8 minutes (GPU) vs 8 hours (CPU).
Which rig for a 10-lawyer law firm RAG?+
A Pro build (~EUR 11,990) or Enterprise (~EUR 25,990) depending on sensitivity. A Pro 2x RTX 5090 + 64 GB VRAM handles 10 users with batching. An Enterprise 2x A6000 NVLink + 96 GB + ECC RAM adds maximal sovereignty, more reassuring for the bar.
At what volume does a local RAG pay back a rig?+
At 15M tokens/month, a Pro build pays back in ~11 months versus GPT-4o. At 25M tokens/month (active agentic RAG), it is 6-7 months. Above 50M tokens/month, Enterprise becomes relevant for a comfortable margin.
RAGOpen SourceStack