Journal · 10 articles

Bench notes from the lab.

Real benchmarks, tested configurations and field notes. No listicles, no SEO bait.

Building a local RAG in 2026: the Ollama + Qdrant + LlamaIndex stack

Four-brick architecture, tech choices (vLLM, Qdrant, LlamaIndex, Open WebUI), GPU sizing by concurrent users, and 24-month TCO vs GPT-4o.

· 9 min2026-05-12

Quantization

Q4 vs Q5 vs Q8: which quantization for Llama 70B in 2026?

VRAM table per quant (Q3 to FP16), measured quality loss (Δ perplexity), GPU recommendations and estimated tok/s per setup. No bullshit.

· 8 min2026-05-12

Llama

Llama 4 locally in 2026: VRAM, GPUs and realistic alternatives

Llama 4 Scout, Maverick, Behemoth: what really fits at home in 2026. VRAM per version, minimum GPUs, and 5 alternatives (70-123B) that compete.

· 8 min2026-05-12

Mistral

Mistral Large 123B locally: which rig, what real cost in 2026

Mistral Large 123B open-weight at home: VRAM by quant, minimum rig (2x A6000 NVLink), ROI vs Mistral API by monthly volume, and when to prefer Llama 3.3 70B instead.

· 9 min2026-05-12

vLLM

vLLM vs Ollama in production: the 2026 benchmark (single user, batching, multi-user)

Real benchmark of the two inference runtimes on RTX 5090 and 2x RTX 5090 NVLink. Single user, 4 concurrent users, 10 users under load: who wins when, and why continuous batching changes everything.

· 8 min2026-05-12

RAG

Sovereign RAG with Qwen 3 30B MoE: the complete 2026 stack

Why Qwen 3 30B-A3B (MoE, 3B active params/token) is the 2026 sweet spot for a sovereign team RAG. Stack: vLLM + Qdrant + nomic-embed + LlamaIndex, on a Pro rig (EUR 11,990). All open-weight, fully self-hosted.

· 9 min2026-05-12

Pricing

How much does an AI server for an SME cost in 2026?

A clear breakdown of the real cost of a local AI rig in 2026: hardware, software, electricity and support, with three priced tiers and a cloud API comparison.

· 8 min2026-05-08

Strategy

Cloud vs on-prem AI: break-even can arrive in 9 months

An honest comparison between OpenAI / Anthropic APIs and a local AI rig, with three concrete TCO scenarios over 24 months.

· 9 min2026-05-08

GPU

RTX 5090 vs Mac Studio M3 Ultra for local LLMs

Two philosophies and two winners depending on the use case: dedicated VRAM vs unified memory, throughput, multi-user serving and EUR per GB.

· 8 min2026-05-08

GPU

Which GPU do you need to run Llama 3.3 70B locally in 2026?

VRAM by quantization, compatible GPUs, RTX 5090 vs A6000 vs H100, and the cost/performance trade-off against OpenAI APIs.

· 9 min2026-05-07