Journal · 14 articles

Bench notes from the lab.

Real benchmarks, tested configurations and field notes. No listicles, no SEO bait.

Rig IA LocalIA détouré
Bench note · featured

Which LLMs run on a budget GPU (RTX 3060, 3070) in 2026?

You don't need an RTX 5090 to start with local AI. The best GPUs by budget (RTX 3060 12 GB, 4060 Ti 16 GB, 3090 24 GB), what fits by VRAM, and the king of VRAM per euro on the used market.

· 7 min
Apple

Can you run a local LLM on a Mac Mini M4? (2026)

The Mac Mini M4 runs local LLMs surprisingly well thanks to unified memory (up to 64 GB acting as VRAM). Which models fit per configuration, real tokens/second, and Mac vs a dedicated GPU.

· 7 min
GPU

RTX 5090 vs RTX 4090 for local AI: which to choose in 2026?

The calm matchup from the angle that matters: which models run, at what throughput, for what price. VRAM (32 vs 24 GB), GDDR7 bandwidth, a models table, and the verdict by profile.

· 7 min
VRAM

How much VRAM to run a local LLM? (formula + 2026 table)

The exact method to estimate an LLM's VRAM: model weights x bytes per parameter, KV cache, overhead. A ready-to-use table (7B to 123B x Q4/Q5/Q8) and the minimum card per model.

· 8 min
RAG

Building a local RAG in 2026: the Ollama + Qdrant + LlamaIndex stack

Four-brick architecture, tech choices (vLLM, Qdrant, LlamaIndex, Open WebUI), GPU sizing by concurrent users, and 24-month TCO vs GPT-4o.

· 9 min
Quantization

Q4 vs Q5 vs Q8: which quantization for Llama 70B in 2026?

VRAM table per quant (Q3 to FP16), measured quality loss (Δ perplexity), GPU recommendations and estimated tok/s per setup. No bullshit.

· 8 min
Llama

Llama 4 locally in 2026: VRAM, GPUs and realistic alternatives

Llama 4 Scout, Maverick, Behemoth: what really fits at home in 2026. VRAM per version, minimum GPUs, and 5 alternatives (70-123B) that compete.

· 8 min
Mistral

Mistral Large 123B locally: which rig, what real cost in 2026

Mistral Large 123B open-weight at home: VRAM by quant, minimum rig (2x A6000 NVLink), ROI vs Mistral API by monthly volume, and when to prefer Llama 3.3 70B instead.

· 9 min
vLLM

vLLM vs Ollama in production: the 2026 benchmark (single user, batching, multi-user)

Real benchmark of the two inference runtimes on RTX 5090 and 2x RTX 5090 NVLink. Single user, 4 concurrent users, 10 users under load: who wins when, and why continuous batching changes everything.

· 8 min
RAG

Sovereign RAG with Qwen 3 30B MoE: the complete 2026 stack

Why Qwen 3 30B-A3B (MoE, 3B active params/token) is the 2026 sweet spot for a sovereign team RAG. Stack: vLLM + Qdrant + nomic-embed + LlamaIndex, on a Pro rig (EUR 11,990). All open-weight, fully self-hosted.

· 9 min
Pricing

How much does an AI server for an SME cost in 2026?

A clear breakdown of the real cost of a local AI rig in 2026: hardware, software, electricity and support, with three priced tiers and a cloud API comparison.

· 8 min
Strategy

Cloud vs on-prem AI: break-even can arrive in 9 months

An honest comparison between OpenAI / Anthropic APIs and a local AI rig, with three concrete TCO scenarios over 24 months.

· 9 min
GPU

RTX 5090 vs Mac Studio M3 Ultra for local LLMs

Two philosophies and two winners depending on the use case: dedicated VRAM vs unified memory, throughput, multi-user serving and EUR per GB.

· 8 min
GPU

Which GPU do you need to run Llama 3.3 70B locally in 2026?

VRAM by quantization, compatible GPUs, RTX 5090 vs A6000 vs H100, and the cost/performance trade-off against OpenAI APIs.

· 9 min