Kann man ein LLM lokal auf einem Mac Mini M4 betreiben? (2026)
Der Mac Mini M4 betreibt lokale LLMs erstaunlich gut dank Unified Memory (bis 64 GB als VRAM). Welche Modelle je Konfiguration passen, reale tok/s und Mac vs dedizierte GPU.

Uebersetzter Artikel. Diese Version ist lokalisiert, damit internationale Seiten keinen franzoesischen Artikeltext anzeigen. Technische Daten, Preise und Empfehlungen bleiben gleich.
Unified memory = your VRAM
On a Mac, the CPU and GPU share the same memory (unified architecture). In practice, all the RAM can serve as VRAM, macOS reserves part of it, so count on ~70% actually available for the model. That is Apple's big advantage over a dedicated graphics card capped at 24 or 32 GB.
| Mac Mini M4 (base) | 16 / 24 / 32 GB | ~120 GB/s | ~EUR 700-1,100 |
| Mac Mini M4 Pro | 24 / 48 / 64 GB | ~273 GB/s | ~EUR 1,500-2,400 |
The catch: memory bandwidth
The M4 Pro has 2.3x the memory bandwidth of the base M4 (273 vs 120 GB/s). Because LLM inference is memory-bound, that number, more than the RAM, decides your tokens per second.
Which LLMs fit by RAM
| 16 GB | 7-8B in Q4/Q5 (Llama 8B, Mistral 7B) | 14B in Q4 |
| 24 GB | 14B in Q5, 32B in Q3 | 32B in Q4 tight |
| 48 GB | 32B in Q5/Q6 (Qwen 2.5 32B) | 70B in Q3 |
| 64 GB | 70B in Q4 (Llama 3.3 70B ~40 GB) | 70B in Q5 just |
Real throughput (tokens/second)
Fine for solo chat or a background coding assistant. Not for serving several users at once: the Mac does not do efficient batching (Metal is less mature than CUDA here).
| Mac Mini M4 (120 GB/s) | 8B Q4: ~18-24 tok/s | 32B Q5: too slow / won't fit |
| Mac Mini M4 Pro (273 GB/s) | 8B Q4: ~40-50 tok/s | 32B Q5: ~10-14 tok/s |
Mac Mini M4 or a dedicated GPU?
- Max VRAM: 64 GB unified (Mac) vs 24-32 GB (RTX 4090/5090).
- Throughput: dedicated GPU is clearly faster (GDDR7).
- Multi-user: dedicated GPU wins (vLLM batching); Mac is weak.
- Power/noise: Mac ~30-50 W and silent; RTX 350-575 W and loud.
- Entry price: ~EUR 700 (Mac base) vs ~EUR 1,100 (used 4090).
The Mac Mini M4 is excellent for solo, silent, low-power use up to 32B models (M4 Pro 48 GB), even 70B in Q4 (64 GB). For high throughput or several users, a dedicated GPU still wins. Check your exact setup in the LocalIA GPU to LLM calculator: it tells you what fits and at which quantization. Free, no signup, independent resource, we sell nothing.
Rechner öffnen / frag uns um Rat mit Zielmodell, Nutzern und Randbedingungen.