Llama.cpp

An End-to-End Workflow for Evaluating & Tuning Local LLMs for Agents

May 25, 2026

Llm-Evaluation-Workflow, Benchmark-Orchestration, Model-Selection

Local-Llm, Workflow, Benchmarking, Evaluation, Model-Selection, Tuning, Process, Moe

Decision-first: Follow this order and you’ll have a deployable model + tuned config in days, not weeks: (1) scope the hardware, (2) shortlist by active params, (3) per-model OFAT matrix, (4) run serially with an OOM guard (smoke first), (5) write a finding card per model, (6) decide. The expensive mistakes are skipping the smoke step, sweeping more than one factor at once, and trusting a single run.

Scope & freshness: Process is model/hardware-independent; the worked numbers are from a 2026-05 effort on a GB10 (128 GB) + an Apple-Silicon Mac, evaluating local MoE models vs cloud baselines for agentic coding. Re-validate the findings, not the workflow.

Running Local LLMs on the NVIDIA GB10 (DGX Spark / ASUS Ascent GX10)

May 25, 2026

Infrastructure

Intermediate, Advanced

Local-Llm-Deployment, Gpu-Memory-Sizing, Model-Runtime-Selection, Moe-Model-Selection

Gb10, Dgx-Spark, Asus-Ascent-Gx10, Local-Llm, Lm-Studio, Llama-Cpp, Gguf, Unified-Memory, Moe, Grace-Blackwell, Dcgm

Lm-Studio, Lms, Llama.cpp, Dcgm-Exporter, Ssh

Decision-first: On a GB10, pick low-active MoE models (A3B-class), serve GGUF (not MLX) via LM Studio, run one model at a time behind an OOM guard, and monitor GPU via DCGM but read the model footprint from system RAM (no framebuffer metrics). Dense 70B is unusable (~2-3 tok/s).

Scope & freshness: GB10 / Grace-Blackwell, 128 GB unified, DCGM 4.5.3 + driver 580-class, as of 2026-05-25. Re-check the DCGM profiling/framebuffer gaps after a driver/DCGM bump (≥585).

Tuning Local LLMs for Agentic Coding: Sampling, Reasoning, and Budgets

May 25, 2026

Agent-Tooling

Intermediate, Advanced

Llm-Tuning, Sampling-Configuration, Prompt-Directive-Design, Budget-Configuration

Local-Llm, Tuning, Temperature, Reasoning, Sampling, Prompt-Engineering, Moe, Ollama, Lm-Studio, Tool-Calling

Lm-Studio, Ollama, Llama.cpp

Decision-first: Per new model, sweep temperature (don’t assume 0.3), try reasoning off for builders, test echo_reasoning both ways, and on budget_exceeded check turns-vs-tokens before changing either. The right config is model-specific — assume nothing.

Scope & freshness: Local + cloud models for agentic coding, 2026-05. Findings are per-model (see the specific models named); treat them as examples of shape, not universal constants — re-sweep for any new model.