An End-to-End Workflow for Evaluating & Tuning Local LLMs for Agents

Decision-first: Follow this order and you’ll have a deployable model + tuned config in days, not weeks: (1) scope the hardware, (2) shortlist by active params, (3) per-model OFAT matrix, (4) run serially with an OOM guard (smoke first), (5) write a finding card per model, (6) decide. The expensive mistakes are skipping the smoke step, sweeping more than one factor at once, and trusting a single run.

Scope & freshness: Process is model/hardware-independent; the worked numbers are from a 2026-05 effort on a GB10 (128 GB) + an Apple-Silicon Mac, evaluating local MoE models vs cloud baselines for agentic coding. Re-validate the findings, not the workflow.

Benchmarking Local LLMs for Agentic Coding

Decision-first: Evaluate on the agent loop (read/edit/test/push), not one-shot patches. Use a multi-file execution-stamina task as your discriminator, tune OFAT at N≥3, and distinguish turn-ceiling vs token-ceiling vs capability-ceiling — only the last is unfixable by config.

Scope & freshness: Methodology is durable; the named results are 2026-05 snapshots — re-run the harness for current models.

Why public leaderboard scores mislead#

SWE-bench-style and chat leaderboards measure something adjacent to, but not the same as, autonomous tool-using coding. A model can score well on one-shot patch generation and still fail as an agent because the agent loop demands sustained, multi-turn behavior: read files, edit several, run tests, react to failures, and push — without giving up, looping, or declaring “done” early. Evaluate on the loop you’ll actually run.

GPU and Host Monitoring Across Mac and Linux/GB10 in One Prometheus

Decision-first: macOS and Linux node_exporter expose different metric names — write per-OS memory/disk expressions. The stock node dashboard hides Darwin on purpose. Scrape external hosts via ScrapeConfig + relabel job/instance. On a GB10, there are no GPU framebuffer or profiling metrics — read model footprint from system RAM.

Scope & freshness: kube-prometheus-stack + node_exporter + DCGM, macOS + Linux/GB10, as of 2026-05-25. Re-check the GB10 DCGM gaps after a DCGM/driver bump.

Operational Pitfalls: Running Local LLMs Alongside Dev Clusters

Decision-first: One model per GPU (cloud-main + local-wake-filter for multi-model); unload-and-verify before every load; never lower the Docker Desktop VM cap; tunnel to loopback to dodge macOS Local Network Privacy; serialize loads and don’t download during inference.

Scope & freshness: Apple-Silicon Mac + minikube/Docker Desktop and a single-GPU LLM host (GB10), as of 2026-05-25. Incident patterns are durable; specific recovery commands assume kubectl/minikube/Docker Desktop.

A field runbook of failure modes seen running local LLMs next to development Kubernetes clusters. Each is a real incident pattern, not a hypothetical. (This whole doc is effectively a “what didn’t work” catalog — that’s the point.)

Running Local LLMs on the NVIDIA GB10 (DGX Spark / ASUS Ascent GX10)

Decision-first: On a GB10, pick low-active MoE models (A3B-class), serve GGUF (not MLX) via LM Studio, run one model at a time behind an OOM guard, and monitor GPU via DCGM but read the model footprint from system RAM (no framebuffer metrics). Dense 70B is unusable (~2-3 tok/s).

Scope & freshness: GB10 / Grace-Blackwell, 128 GB unified, DCGM 4.5.3 + driver 580-class, as of 2026-05-25. Re-check the DCGM profiling/framebuffer gaps after a driver/DCGM bump (≥585).

Serving LLMs on an Apple Silicon Mac That Also Runs a Dev Cluster

Decision-first: A Mac running a dev cluster is a lite-tier LLM host only (~8 GB models). It can’t hold even one large (~24 GB-resident) model alongside the cluster. Standardize on GGUF (Ollama can’t do MLX); don’t lower the Docker VM cap to “free RAM.”

Scope & freshness: 64 GB Apple-Silicon Mac running minikube/Docker Desktop, as of 2026-05-25. Numbers scale with your RAM and cluster size — re-measure, but the shape (cluster + one big model exhausts the box) holds.

Tuning Local LLMs for Agentic Coding: Sampling, Reasoning, and Budgets

Decision-first: Per new model, sweep temperature (don’t assume 0.3), try reasoning off for builders, test echo_reasoning both ways, and on budget_exceeded check turns-vs-tokens before changing either. The right config is model-specific — assume nothing.

Scope & freshness: Local + cloud models for agentic coding, 2026-05. Findings are per-model (see the specific models named); treat them as examples of shape, not universal constants — re-sweep for any new model.

Autonomy Tiers and Escalation as Runtime Contracts, Not Prompt Instructions

An agent is dispatched on a task it cannot complete. The spec is broken. The dependency is missing. The credentials are wrong. What happens next determines whether you have an autonomous fleet or a fleet that quietly fails.

The most common answer — instructing the agent in its prompt to “ask for help if stuck” — does not survive contact with production. Agents either keep grinding and produce broken work, or output text that looks like a question but never reaches a human, or politely “complete” the task by writing nothing and reporting success. None of these failure modes are visible from the outside until the dashboards have been lying for hours.

Heterogeneous A/B/C/D Pool Dispatch: Real Model Comparison Without an Eval Harness

You need to know whether model-X is worth deploying for your real workload. The benchmarks suggest yes, but benchmarks are static and your workload is not. The standard answer — build an eval harness — runs into two structural problems: harnesses are expensive to build well, and they tend to over-fit to the inputs you remembered to include in the corpus, missing the real production failure modes you discover only later.

Advanced Ansible Patterns: Roles, Collections, Dynamic Inventory, Vault, and Testing

Advanced Ansible Patterns#

As infrastructure grows from a handful of servers to hundreds or thousands, Ansible patterns that worked at small scale become bottlenecks. Playbooks that were simple and readable at 10 hosts become tangled at 100. Roles that were self-contained become duplicated across teams. This framework helps you decide which advanced patterns to adopt and when.

Roles vs Collections#

Roles and collections both organize Ansible content, but they serve different purposes and operate at different scales.