---
title: "Operational Pitfalls: Running Local LLMs Alongside Dev Clusters"
description: "Hard-won operational gotchas for local-LLM fleets: why two distinct models won't share one GPU, the one-model-at-a-time OOM guard, don't-cap-the-Docker-VM, post-bounce stuck-pod recovery, the macOS Local Network Privacy trap, and not testing while downloading."
url: https://agent-zone.ai/knowledge/sre/operational-pitfalls-local-llms-dev-clusters/
section: knowledge
date: 2026-05-25
categories: ["sre"]
tags: ["local-llm","incident-prevention","docker-desktop","minikube","ollama","gpu","oom","recovery","runbook"]
skills: ["incident-prevention","cluster-recovery","oom-prevention","gpu-capacity-ops"]
tools: ["ollama","lm-studio","docker-desktop","minikube","kubectl"]
levels: ["intermediate","advanced"]
word_count: 748
formats:
  json: https://agent-zone.ai/knowledge/sre/operational-pitfalls-local-llms-dev-clusters/index.json
  html: https://agent-zone.ai/knowledge/sre/operational-pitfalls-local-llms-dev-clusters/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Operational+Pitfalls%3A+Running+Local+LLMs+Alongside+Dev+Clusters
---


> **Decision-first:** One model per GPU (cloud-main + local-wake-filter for multi-model); unload-and-verify before every load; never lower the Docker Desktop VM cap; tunnel to loopback to dodge macOS Local Network Privacy; serialize loads and don't download during inference.

> **Scope & freshness:** Apple-Silicon Mac + minikube/Docker Desktop and a single-GPU LLM host (GB10), as of 2026-05-25. Incident patterns are durable; specific recovery commands assume kubectl/minikube/Docker Desktop.

A field runbook of failure modes seen running local LLMs next to development Kubernetes clusters. Each is a real incident pattern, not a hypothetical. (This whole doc is effectively a "what didn't work" catalog — that's the point.)

## Two distinct models won't share one GPU

Putting a "wake-filter" small model + a larger main model on the **same single GPU** fails on current consumer/workstation hardware — in one of two ways:

1. **Load/unload thrash.** With limited headroom, each wake-filter call evicts the main model from the GPU; the next main call reloads it from disk (multi-second), and quality drifts after a hot reload. Symptom: **20 output tokens, 0 tool calls** in production despite passing a *sequential* POC (which never loaded both at once).
2. **Memory exhaustion.** With both genuinely co-resident, you simply run out of unified memory (see the Apple-Silicon guide — one big model + a dev cluster already maxed a 64 GB Mac).

**Rule:** one model per GPU. The validated multi-model pattern is **cloud-main + local-wake-filter** (or split models across separate hosts), never two distinct models on one GPU. More VRAM alone doesn't fix the thrash — the architecture has to avoid the swap.

## One model at a time + an OOM guard

On a single-GPU box, load **one** large model at a time, and **verify nothing is resident before loading** — don't trust that the previous unload landed (especially over flaky SSH). A runtime's JIT auto-load can silently bring a second model back on request and OOM the host. Guard explicitly:

```bash
# unload, then retry+verify before loading the target — abort if anything is still resident
unload_all; for i in 1 2 3; do [ -z "$(list_loaded)" ] && break; unload_all; sleep 3; done
[ -n "$(list_loaded)" ] && { echo "ABORT: model still resident — OOM risk"; exit 1; }
load "$TARGET"
```

## Don't lower the Docker Desktop VM memory cap

To free RAM for a model, capping the Docker Desktop VM **backfires**. On Apple Silicon the VM reclaims unused memory dynamically (the cap is a ceiling, not a reservation), and the cluster **bursts on cold start** — capping too low (e.g. 12 GB) OOM-kills core services for no real gain. Leave it at default. Two extra traps when bouncing Docker:

- The restart needs a **desktop "allow" dialog click** — a headless/scripted restart hangs at `no route to host` (the VM never finishes booting) until someone clicks it.
- After the bounce, the cluster comes back with **stuck pods** (`Init:Error`, `OOMKilled`, `CrashLoopBackOff`) — see recovery below.

## Post-bounce cluster recovery

After any Docker/minikube restart, expect a pod stampede. Recover in order:

```bash
# 1. confirm the node is back
kubectl get nodes
# 2. confirm dependencies first (DB pods that others crash-loop against)
kubectl get pods -n <ns> | grep -Ei "postgres|pgbouncer"
# 3. force-delete the wreckage so controllers recreate fresh (once deps are up)
kubectl get pods -n <ns> --no-headers | grep -vE "Running|Completed" \
  | awk '{print $1}' | xargs -r kubectl delete pod -n <ns> --grace-period=0 --force
```

Core services (gitea/mattermost/etc.) often crash-loop transiently because their DB wasn't ready during the stampede; once the DB is up, deleting the stuck pod lets it start clean.

## The macOS Local Network Privacy trap

A locally-built binary (e.g., a Go client) calling an LLM host over the **LAN** can be blocked by macOS Local Network Privacy with `no route to host`, while `curl` works (it's exempt). It looks like a cable/link fault but isn't. **Fix:** SSH-tunnel the LLM port to loopback (`-L 18888:localhost:8888`) and point the client at `127.0.0.1` — loopback is LNP-exempt.

## Don't benchmark or download while a model is running

On a bandwidth-bound box, concurrent activity contends for the same memory bandwidth. Running a second inference, or **downloading a model while inference is live**, tanks both. Serialize: download all models first, then benchmark one model at a time.

## Quick checklist

- One model per GPU; cloud-main + local-wake-filter for multi-model.
- Unload-and-verify before every load (OOM guard).
- Never lower the Docker Desktop VM cap; expect the allow-dialog + stuck-pod recovery on a bounce.
- Tunnel to loopback to dodge macOS Local Network Privacy.
- Serial loads; never download while inference runs.

