Gpu-Capacity-Ops

Operational Pitfalls: Running Local LLMs Alongside Dev Clusters

May 25, 2026

Incident-Prevention, Cluster-Recovery, Oom-Prevention, Gpu-Capacity-Ops

Local-Llm, Incident-Prevention, Docker-Desktop, Minikube, Ollama, Gpu, Oom, Recovery, Runbook

Ollama, Lm-Studio, Docker-Desktop, Minikube, Kubectl

Decision-first: One model per GPU (cloud-main + local-wake-filter for multi-model); unload-and-verify before every load; never lower the Docker Desktop VM cap; tunnel to loopback to dodge macOS Local Network Privacy; serialize loads and don’t download during inference.

Scope & freshness: Apple-Silicon Mac + minikube/Docker Desktop and a single-GPU LLM host (GB10), as of 2026-05-25. Incident patterns are durable; specific recovery commands assume kubectl/minikube/Docker Desktop.

A field runbook of failure modes seen running local LLMs next to development Kubernetes clusters. Each is a real incident pattern, not a hypothetical. (This whole doc is effectively a “what didn’t work” catalog — that’s the point.)