Serving LLMs on an Apple Silicon Mac That Also Runs a Dev Cluster

Decision-first: A Mac running a dev cluster is a lite-tier LLM host only (~8 GB models). It can’t hold even one large (~24 GB-resident) model alongside the cluster. Standardize on GGUF (Ollama can’t do MLX); don’t lower the Docker VM cap to “free RAM.”

Scope & freshness: 64 GB Apple-Silicon Mac running minikube/Docker Desktop, as of 2026-05-25. Numbers scale with your RAM and cluster size — re-measure, but the shape (cluster + one big model exhausts the box) holds.

Building ARM64 Container Images When Upstream Doesn't Ship Them

A pod is CrashLoopBackOff with no application stack trace. The container manifest’s image field references an upstream tag that “should” work. The Docker pull succeeded. The container starts, exits with no log output, and restarts. The cause is almost always architecture: the image was published linux/amd64 only, the host is ARM64 (Apple Silicon, Graviton, Ampere), and the runtime is silently emulating — or failing to emulate — the binary. When upstream publishes ARM64 source artifacts but no ARM64 image, the fix is to build one.

Running Kubernetes on Apple Silicon: Setup, Gotchas, Recovery

A minikube cluster on Apple Silicon looks like a pure Kubernetes problem until the first Docker Desktop crash. The failure modes that bite hardest on M-series Macs live one layer below the cluster: in Docker Desktop’s memory allocator, in QEMU’s address-space layout, and in the destructive default of minikube delete. None of these are mentioned in the standard minikube setup guide, and all three will eat real workload state when they fire. This is the operational layer on top of minikube setup and drivers and ARM64 K8s images — the host-side discipline that keeps the cluster alive.

Ollama Setup and Model Management: Installation, Model Selection, Memory Management, and ARM64 Native

Ollama Setup and Model Management#

Ollama turns running local LLMs into a single command. It handles model downloads, quantization, GPU memory allocation, and exposes a REST API that any application can call. No Python environments, no CUDA driver debugging, no manual GGUF file management.

Installation#

# macOS
brew install ollama

# Linux (official installer)
curl -fsSL https://ollama.com/install.sh | sh

# Or run as a Docker container
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Start the Ollama server:

Minikube Setup, Drivers, and Resource Configuration

Minikube Setup, Drivers, and Resource Configuration#

Minikube runs a single-node Kubernetes cluster on your local machine. The difference between a minikube setup that feels like a toy and one that behaves like production comes down to three choices: the driver, the resource allocation, and the Kubernetes version. Get these wrong and you spend more time fighting the tool than using it.

Installation#

On macOS with Homebrew:

brew install minikube

On Linux via direct download:

Minikube with Docker Driver on Apple Silicon

Why the Docker Driver on ARM64#

When running Minikube on Apple Silicon (M1/M2/M3/M4), the driver you choose determines whether your containers run natively or through emulation. The Docker driver runs containers directly on the host architecture — ARM64 — with zero emulation overhead.

This matters because QEMU user-mode emulation, which kicks in when you try to run amd64 images on ARM64, cannot reliably execute Go binaries. The specific failure is a crash in lfstack.push, deep in Go’s runtime memory management. This is not a fixable application bug — it is a fundamental incompatibility between QEMU’s user-mode emulation and Go’s lock-free stack implementation.