Authoring Research Knowledge for Agents (Trust-but-Verify Format)

Decision-first: Write research docs so a downstream agent can act in 30 seconds and verify cheaply. Lead with the recommendation + its biggest caveat; attach a one-line verification recipe to every load-bearing claim; put what didn’t work where it can’t be missed. Descriptive “how X works” prose is the least valuable part.

Scope & freshness: Format conventions, version-independent. Authored 2026-05-25 from a local-LLM benchmarking effort; the examples are LLM/GPU but the format applies to any research/benchmarking knowledge.

GPU and Host Monitoring Across Mac and Linux/GB10 in One Prometheus

Decision-first: macOS and Linux node_exporter expose different metric names — write per-OS memory/disk expressions. The stock node dashboard hides Darwin on purpose. Scrape external hosts via ScrapeConfig + relabel job/instance. On a GB10, there are no GPU framebuffer or profiling metrics — read model footprint from system RAM.

Scope & freshness: kube-prometheus-stack + node_exporter + DCGM, macOS + Linux/GB10, as of 2026-05-25. Re-check the GB10 DCGM gaps after a DCGM/driver bump.

Operational Pitfalls: Running Local LLMs Alongside Dev Clusters

Decision-first: One model per GPU (cloud-main + local-wake-filter for multi-model); unload-and-verify before every load; never lower the Docker Desktop VM cap; tunnel to loopback to dodge macOS Local Network Privacy; serialize loads and don’t download during inference.

Scope & freshness: Apple-Silicon Mac + minikube/Docker Desktop and a single-GPU LLM host (GB10), as of 2026-05-25. Incident patterns are durable; specific recovery commands assume kubectl/minikube/Docker Desktop.

A field runbook of failure modes seen running local LLMs next to development Kubernetes clusters. Each is a real incident pattern, not a hypothetical. (This whole doc is effectively a “what didn’t work” catalog — that’s the point.)

Realistic GPU/Memory Sizing for Local LLMs

Decision-first: Budget file size + KV(context) + overhead, not file size — and on unified memory, subtract OS + co-resident workloads first. “Barely fits” means doesn’t fit. Size memory by total params, speed by active params.

Scope & freshness: General sizing principles (version-independent); worked numbers from 2026-05 on a GB10 (128 GB unified) + a 64 GB Apple-Silicon Mac. Re-measure resident sizes for your model/quant/context.

Resident size is bigger than the file#

The single most common sizing mistake is equating the model file size with how much memory it needs at runtime. Resident footprint is:

Running Local LLMs on the NVIDIA GB10 (DGX Spark / ASUS Ascent GX10)

Decision-first: On a GB10, pick low-active MoE models (A3B-class), serve GGUF (not MLX) via LM Studio, run one model at a time behind an OOM guard, and monitor GPU via DCGM but read the model footprint from system RAM (no framebuffer metrics). Dense 70B is unusable (~2-3 tok/s).

Scope & freshness: GB10 / Grace-Blackwell, 128 GB unified, DCGM 4.5.3 + driver 580-class, as of 2026-05-25. Re-check the DCGM profiling/framebuffer gaps after a driver/DCGM bump (≥585).

Serving LLMs on an Apple Silicon Mac That Also Runs a Dev Cluster

Decision-first: A Mac running a dev cluster is a lite-tier LLM host only (~8 GB models). It can’t hold even one large (~24 GB-resident) model alongside the cluster. Standardize on GGUF (Ollama can’t do MLX); don’t lower the Docker VM cap to “free RAM.”

Scope & freshness: 64 GB Apple-Silicon Mac running minikube/Docker Desktop, as of 2026-05-25. Numbers scale with your RAM and cluster size — re-measure, but the shape (cluster + one big model exhausts the box) holds.

Tuning Local LLMs for Agentic Coding: Sampling, Reasoning, and Budgets

Decision-first: Per new model, sweep temperature (don’t assume 0.3), try reasoning off for builders, test echo_reasoning both ways, and on budget_exceeded check turns-vs-tokens before changing either. The right config is model-specific — assume nothing.

Scope & freshness: Local + cloud models for agentic coding, 2026-05. Findings are per-model (see the specific models named); treat them as examples of shape, not universal constants — re-sweep for any new model.

Closed-Loop DONE for Autonomous Agent CI/CD: Why 'PR Opened' Is Not Shipped

A backlog item flips to status='completed' in the database. The dashboard ticks up. The agent posts “PR ready for review” and walks away. Three hours later, a different agent notices the fleet is running yesterday’s binary. The PR was never reviewed. CI was red on main. No image got built. Nothing actually shipped.

This is the closed-loop problem. When an autonomous agent declares work complete, what does “complete” mean? In most agent fleets, it means the agent called the last tool in its own workflow — typically open_pr or its equivalent. That is not the same as “the change is live for users”, and the gap between the two is where state-of-record systematically lies.

Verifying LLM-Written SDK Code Against the Pinned Version: A Recipe Against Type Hallucination

An agent writes a 200-line streaming-client implementation against your project’s pinned SDK. It compiles cleanly in the model’s head. The test code references SomeStreamEvent, the streaming function signature is func NewStreaming(ctx, params) (stream, error), and the iteration loop uses stream.Recv(). The reviewer skims it, sees plausible naming, approves. CI fails with “undefined: SomeStreamEvent”. The agent escalates: “the SDK is broken — package not found.” Hours later, somebody figures out that the SDK they’re pinned to has none of those symbols. The import path is different. The function returns one value not two. The iteration pattern is Next() / Current() / Err(), not Recv(). The model invented the API.

Building ARM64 Container Images When Upstream Doesn't Ship Them

A pod is CrashLoopBackOff with no application stack trace. The container manifest’s image field references an upstream tag that “should” work. The Docker pull succeeded. The container starts, exits with no log output, and restarts. The cause is almost always architecture: the image was published linux/amd64 only, the host is ARM64 (Apple Silicon, Graviton, Ampere), and the runtime is silently emulating — or failing to emulate — the binary. When upstream publishes ARM64 source artifacts but no ARM64 image, the fix is to build one.