Running Local LLMs on the NVIDIA GB10 (DGX Spark / ASUS Ascent GX10)

Decision-first: On a GB10, pick low-active MoE models (A3B-class), serve GGUF (not MLX) via LM Studio, run one model at a time behind an OOM guard, and monitor GPU via DCGM but read the model footprint from system RAM (no framebuffer metrics). Dense 70B is unusable (~2-3 tok/s).

Scope & freshness: GB10 / Grace-Blackwell, 128 GB unified, DCGM 4.5.3 + driver 580-class, as of 2026-05-25. Re-check the DCGM profiling/framebuffer gaps after a driver/DCGM bump (≥585).

Local LLMs for AI Agents: When It Makes Sense, When It Doesn't

A coding agent burns through tokens. The monthly bill from a frontier API provider for a single moderately active agent lands somewhere between fifty and a few hundred dollars, and the natural reaction is to check whether a one-time hardware purchase would be cheaper. The naive comparison — dollars per million tokens versus dollars amortized over five years — almost always concludes that local wins. The honest comparison rarely does, at least for coding workloads, at least as of mid-2026. The reason is a capability gap that doesn’t show up in any cost spreadsheet.