Tiered-LLM Tooling: Local Model by Default, Escalate to the Frontier Model

Tiered-LLM Tooling: Local by Default, Escalate to Frontier#

When you build a chat or ops interface backed by an LLM, paying a frontier model for every interaction is wasteful — most interactions are cheap lookups, summaries, and routing. A tiered design serves the high-frequency majority with a small local model (e.g. an Ollama-served model on a GPU you already have) and escalates to a frontier model (e.g. Claude) only for the hard minority.

An End-to-End Workflow for Evaluating & Tuning Local LLMs for Agents

Decision-first: Follow this order and you’ll have a deployable model + tuned config in days, not weeks: (1) scope the hardware, (2) shortlist by active params, (3) per-model OFAT matrix, (4) run serially with an OOM guard (smoke first), (5) write a finding card per model, (6) decide. The expensive mistakes are skipping the smoke step, sweeping more than one factor at once, and trusting a single run.

Scope & freshness: Process is model/hardware-independent; the worked numbers are from a 2026-05 effort on a GB10 (128 GB) + an Apple-Silicon Mac, evaluating local MoE models vs cloud baselines for agentic coding. Re-validate the findings, not the workflow.

Benchmarking Local LLMs for Agentic Coding

Decision-first: Evaluate on the agent loop (read/edit/test/push), not one-shot patches. Use a multi-file execution-stamina task as your discriminator, tune OFAT at N≥3, and distinguish turn-ceiling vs token-ceiling vs capability-ceiling — only the last is unfixable by config.

Scope & freshness: Methodology is durable; the named results are 2026-05 snapshots — re-run the harness for current models.

Why public leaderboard scores mislead#

SWE-bench-style and chat leaderboards measure something adjacent to, but not the same as, autonomous tool-using coding. A model can score well on one-shot patch generation and still fail as an agent because the agent loop demands sustained, multi-turn behavior: read files, edit several, run tests, react to failures, and push — without giving up, looping, or declaring “done” early. Evaluate on the loop you’ll actually run.

GPU and Host Monitoring Across Mac and Linux/GB10 in One Prometheus

Decision-first: macOS and Linux node_exporter expose different metric names — write per-OS memory/disk expressions. The stock node dashboard hides Darwin on purpose. Scrape external hosts via ScrapeConfig + relabel job/instance. On a GB10, there are no GPU framebuffer or profiling metrics — read model footprint from system RAM.

Scope & freshness: kube-prometheus-stack + node_exporter + DCGM, macOS + Linux/GB10, as of 2026-05-25. Re-check the GB10 DCGM gaps after a DCGM/driver bump.

Operational Pitfalls: Running Local LLMs Alongside Dev Clusters

Decision-first: One model per GPU (cloud-main + local-wake-filter for multi-model); unload-and-verify before every load; never lower the Docker Desktop VM cap; tunnel to loopback to dodge macOS Local Network Privacy; serialize loads and don’t download during inference.

Scope & freshness: Apple-Silicon Mac + minikube/Docker Desktop and a single-GPU LLM host (GB10), as of 2026-05-25. Incident patterns are durable; specific recovery commands assume kubectl/minikube/Docker Desktop.

A field runbook of failure modes seen running local LLMs next to development Kubernetes clusters. Each is a real incident pattern, not a hypothetical. (This whole doc is effectively a “what didn’t work” catalog — that’s the point.)

Realistic GPU/Memory Sizing for Local LLMs

Decision-first: Budget file size + KV(context) + overhead, not file size — and on unified memory, subtract OS + co-resident workloads first. “Barely fits” means doesn’t fit. Size memory by total params, speed by active params.

Scope & freshness: General sizing principles (version-independent); worked numbers from 2026-05 on a GB10 (128 GB unified) + a 64 GB Apple-Silicon Mac. Re-measure resident sizes for your model/quant/context.

Resident size is bigger than the file#

The single most common sizing mistake is equating the model file size with how much memory it needs at runtime. Resident footprint is:

Running Local LLMs on the NVIDIA GB10 (DGX Spark / ASUS Ascent GX10)

Decision-first: On a GB10, pick low-active MoE models (A3B-class), serve GGUF (not MLX) via LM Studio, run one model at a time behind an OOM guard, and monitor GPU via DCGM but read the model footprint from system RAM (no framebuffer metrics). Dense 70B is unusable (~2-3 tok/s).

Scope & freshness: GB10 / Grace-Blackwell, 128 GB unified, DCGM 4.5.3 + driver 580-class, as of 2026-05-25. Re-check the DCGM profiling/framebuffer gaps after a driver/DCGM bump (≥585).

Serving LLMs on an Apple Silicon Mac That Also Runs a Dev Cluster

Decision-first: A Mac running a dev cluster is a lite-tier LLM host only (~8 GB models). It can’t hold even one large (~24 GB-resident) model alongside the cluster. Standardize on GGUF (Ollama can’t do MLX); don’t lower the Docker VM cap to “free RAM.”

Scope & freshness: 64 GB Apple-Silicon Mac running minikube/Docker Desktop, as of 2026-05-25. Numbers scale with your RAM and cluster size — re-measure, but the shape (cluster + one big model exhausts the box) holds.

Tuning Local LLMs for Agentic Coding: Sampling, Reasoning, and Budgets

Decision-first: Per new model, sweep temperature (don’t assume 0.3), try reasoning off for builders, test echo_reasoning both ways, and on budget_exceeded check turns-vs-tokens before changing either. The right config is model-specific — assume nothing.

Scope & freshness: Local + cloud models for agentic coding, 2026-05. Findings are per-model (see the specific models named); treat them as examples of shape, not universal constants — re-sweep for any new model.

Local LLMs for AI Agents: When It Makes Sense, When It Doesn't

A coding agent burns through tokens. The monthly bill from a frontier API provider for a single moderately active agent lands somewhere between fifty and a few hundred dollars, and the natural reaction is to check whether a one-time hardware purchase would be cheaper. The naive comparison — dollars per million tokens versus dollars amortized over five years — almost always concludes that local wins. The honest comparison rarely does, at least for coding workloads, at least as of mid-2026. The reason is a capability gap that doesn’t show up in any cost spreadsheet.