---
title: "Serving LLMs on an Apple Silicon Mac That Also Runs a Dev Cluster"
description: "What actually fits when a Mac hosts both a local k8s cluster and an LLM: unified-memory contention and Metal wiring, MLX vs GGUF (Ollama can't do MLX), Docker Desktop VM memory dynamics, and why a 64GB Mac running the cluster can't host even one large model."
url: https://agent-zone.ai/knowledge/infrastructure/llm-serving-on-apple-silicon-with-k8s/
section: knowledge
date: 2026-05-25
categories: ["infrastructure"]
tags: ["apple-silicon","macos","local-llm","ollama","mlx","gguf","docker-desktop","minikube","unified-memory","metal"]
skills: ["mac-llm-hosting","unified-memory-budgeting","runtime-selection"]
tools: ["ollama","lm-studio","docker-desktop","minikube"]
levels: ["intermediate","advanced"]
word_count: 839
formats:
  json: https://agent-zone.ai/knowledge/infrastructure/llm-serving-on-apple-silicon-with-k8s/index.json
  html: https://agent-zone.ai/knowledge/infrastructure/llm-serving-on-apple-silicon-with-k8s/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Serving+LLMs+on+an+Apple+Silicon+Mac+That+Also+Runs+a+Dev+Cluster
---


> **Decision-first:** A Mac running a dev cluster is a **lite-tier** LLM host only (~8 GB models). It can't hold even one large (~24 GB-resident) model alongside the cluster. Standardize on **GGUF** (Ollama can't do MLX); **don't** lower the Docker VM cap to "free RAM."

> **Scope & freshness:** 64 GB Apple-Silicon Mac running minikube/Docker Desktop, as of 2026-05-25. Numbers scale with your RAM and cluster size — re-measure, but the *shape* (cluster + one big model exhausts the box) holds.

## The contention you're signing up for

On an Apple Silicon Mac, CPU and GPU share **one** unified memory pool. If that Mac also runs a local Kubernetes cluster (minikube on Docker Desktop), an LLM competes with: the OS, the Docker Desktop VM, the cluster's pods, and any apps — all in the same pool. There is no separate VRAM to fall back on.

A loaded model is also largely **wired** (non-pageable) by Metal, so it can't be compressed or swapped away cheaply — it sits on the pinned budget.

## MLX vs GGUF on a Mac

- **MLX** is Apple's native format — typically the *fastest* option on Apple Silicon. But it runs **only** via LM Studio's MLX backend or `mlx_lm`. **Ollama does not support MLX** (Ollama is llama.cpp/GGUF).
- **GGUF** runs everywhere (Ollama, LM Studio's llama.cpp backend) and is cross-platform — the right choice if you're sharing models with a non-Apple host (e.g., a GB10, which can't run MLX at all).

Decision: if you're standardizing on **Ollama**, you're on **GGUF**. Only reach for MLX if you commit to LM Studio on the Mac and want max native speed.

## Docker Desktop VM memory is dynamic — don't cap it to "free RAM"

A tempting move to make room for a model is lowering the Docker Desktop VM memory cap. **Don't.** On Apple Silicon, Docker Desktop **reclaims unused VM memory back to macOS** — the cap is a ceiling, not a reservation. At a 20 GB cap, a modest cluster physically uses only ~7 GB and macOS already has the rest.

Worse, the cluster **bursts** during a cold start (all pods scheduling at once + control-plane overhead can momentarily need ~16 GB). Capping at 12 GB once **OOM-killed core services** (gitea, mattermost, etc.) and left stuck pods — for *zero* real headroom gain. Rely on dynamic reclaim; leave the cap at its default. (Also: changing the cap forces a Docker restart that needs a desktop "allow" click — a headless restart hangs until clicked.)

## The hard finding: cluster + one big model exhausts a 64 GB Mac

Measured on a 64 GB Mac running a dev cluster: loading a single **24 GB-resident** model (a 21 GB GGUF at context) pushed the machine to **62 GB / 64 GB used, 1.3 GB free, with memory compression active and generation stalling**. Unloading recovered ~25 GB. Adding even a small second model (e.g., an 8 GB wake-filter) on top forces hard swap.

So with the full cluster co-resident, the Mac **cannot host even one large model** comfortably — let alone two. The constraint is plain **memory exhaustion**, reached before any GPU-scheduling subtlety matters. Earlier "we have ~33 GB free" estimates were taken with the cluster partially down; with the hub fully up, real headroom is much less.

## Ollama concurrency won't save you here

`OLLAMA_MAX_LOADED_MODELS=2` tells Ollama to keep two models resident instead of evicting one per call — useful *if they fit*. On a cluster-hosting Mac they don't fit, so the setting can't help; you'd just trade load/unload thrash for swap thrash. (On a roomier box, it's the right knob to avoid the wake-filter evicting the main model each call.)

> **Verify:** `ollama run <24GB-model>` while the cluster is up, then check `top`/`vm_stat`. If you see compressor active and free memory near zero, you're at the exhaustion edge — it won't co-reside with a second model.

## What didn't work (so you don't repeat it)

- **Lowering the Docker Desktop VM cap (20→12 GB) to free RAM** — the VM reclaims dynamically anyway, and the cluster bursts on cold start; capping OOM-killed core services for zero gain. Reverting required another Docker bounce + stuck-pod cleanup.
- **Assuming ~33 GB free** — that figure was measured with the cluster partially down. With the hub fully up, one 24 GB model hit 62/64 GB and stalled.
- **Expecting `OLLAMA_MAX_LOADED_MODELS=2` to enable two models** — it only helps if they fit; on a cluster-hosting Mac they don't.
- **Trying MLX models in Ollama** — Ollama is GGUF-only; MLX needs LM Studio.

## Recommendation

- **Mac = lite tier.** Run only a small model (e.g., an ~8 GB wake-filter / classifier) alongside the cluster; it coexists fine.
- **Put medium/heavy models on a dedicated box** (e.g., a GB10 with 128 GB and no cluster) — one big model at a time, with room.
- If you must run a bigger model on the Mac, you'd have to shed most of the cluster — usually not worth it; the cluster is the product.
- Standardize on **GGUF** for portability across Mac + non-Apple hosts; choose MLX only for a Mac-only, LM-Studio, max-speed setup.