Building a local RAG in 2026: the Ollama + Qdrant + LlamaIndex stack
Four-brick architecture, tech choices (vLLM, Qdrant, LlamaIndex, Open WebUI), GPU sizing by concurrent users, and 24-month TCO vs GPT-4o.

RAG (Retrieval-Augmented Generation) became the number one use case for local LLMs in 2026. Your documents, your LLM, your server — no third-party API, no data leak, sourced answers. Here is the stack that works.
Four-brick architecture
| Embedder | Turns text into vectors | BGE-M3, mxbai-embed-large or nomic-embed-text |
| Vector DB | Stores and searches nearest vectors | Qdrant (recommended) or Chroma |
| LLM | Generates the answer from retrieved passages | Llama 3.3 70B Q4 or Qwen 2.5 32B Q5 |
| Orchestrator | Chains everything, prompts, anti-hallucination | LlamaIndex (recommended) or Haystack |
GPU sizing by team size (not just by concurrent users)
There are two numbers to distinguish: the team that shares the rig (the commercial number that matters for your quote), and the simultaneous active users (technical number, peak of requests processed in parallel at the same instant). They are very different.
A typical RAG user types a question, waits 5-15s for the answer, reads 30-60s, types again. So with 10 people permanently connected, only ~1-2 actually have a request executing at the same instant. vLLM continuous batching absorbs the queue — requests stack up without anyone noticing, perceived latency 2-5s even when 3 colleagues hit send in the same second.
| 24 GB (RTX 4090) | Qwen 32B Q5 or Llama 70B Q3 | 5-10 people | 1-2 active |
| 32 GB (RTX 5090) | Llama 70B Q4 or Qwen 32B Q8 | 10-20 people | 2-3 active |
| 48 GB (A6000) | Llama 70B Q5 + KV cache headroom | 20-35 people | 3-5 active |
| 64 GB (2× 5090) | Llama 70B Q5 + vLLM batching | 40-80 people | 5-10 active |
| 96 GB (2× A6000 NVLink) | Llama 70B FP16 or Mistral Large | 80-150 people | 10-20 active |
Default LocalIA stack
- Ubuntu 24.04 LTS + NVIDIA drivers + CUDA 12.6
- vLLM (OpenAI-compatible server) + Ollama (model management)
- Qwen 2.5 32B Q5 by default (quality/throughput/VRAM sweet spot)
- Qdrant + LlamaIndex + Open WebUI
- Embeddings BGE-M3, re-ranker BGE-Reranker-v2-m3
Common pitfalls
- Chunks too large (>2000 tokens dilute the re-ranking signal)
- No re-ranker — adding BGE-Reranker doubles RAG precision
- Wrong embeddings — BGE-M3 multilingual beats MPNet on FR/EN corpora
- No metadata filters — date, department, client tagging is mandatory in enterprise
- Tight VRAM — 90% utilization breaks the moment a user sends 8k tokens
At ~15M tokens/month in RAG, a Pro rig pays back in 11 months versus GPT-4o. Above 25M tokens/month, the Enterprise tier (2× A6000 NVLink) becomes the right call.
Open the calculator / request a quote with your target model, users and constraints.