---
title: "An End-to-End Workflow for Evaluating & Tuning Local LLMs for Agents"
description: "The chain of events that turns 'which local model?' into a deployable answer: scope the hardware, shortlist by active-params, build a per-model OFAT matrix, run serially with an OOM guard (smoke first), record per-model findings, and decide. Includes the decision points and dead-ends to skip."
url: https://agent-zone.ai/knowledge/agent-tooling/local-llm-evaluation-workflow/
section: knowledge
date: 2026-05-25
categories: ["agent-tooling"]
tags: ["local-llm","workflow","benchmarking","evaluation","model-selection","tuning","process","moe"]
skills: ["llm-evaluation-workflow","benchmark-orchestration","model-selection"]
tools: ["lm-studio","ollama","llama.cpp"]
levels: ["advanced"]
word_count: 781
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/local-llm-evaluation-workflow/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/local-llm-evaluation-workflow/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=An+End-to-End+Workflow+for+Evaluating+%26+Tuning+Local+LLMs+for+Agents
---


> **Decision-first:** Follow this order and you'll have a deployable model + tuned config in days, not weeks: (1) scope the hardware, (2) shortlist by *active* params, (3) per-model OFAT matrix, (4) run **serially** with an OOM guard (**smoke first**), (5) write a finding card per model, (6) decide. The expensive mistakes are skipping the smoke step, sweeping more than one factor at once, and trusting a single run.

> **Scope & freshness:** Process is model/hardware-independent; the worked numbers are from a 2026-05 effort on a GB10 (128 GB) + an Apple-Silicon Mac, evaluating local MoE models vs cloud baselines for agentic coding. Re-validate the *findings*, not the *workflow*.

This is the sequence behind the per-topic articles (sizing, tuning, benchmarking, monitoring). Those tell you *what* we learned; this tells you the *order* to work in and where the cliffs are.

## Step 1 — Scope the hardware first (it eliminates candidates before you download anything)

Before shortlisting models, pin the hardware's binding constraint. On bandwidth-bound boxes (unified-memory GB10, Apple Silicon) the constraint is **memory bandwidth**, which makes **active parameter count**, not total, the speed driver. This step alone disqualifies whole model classes.

> **Verify:** decode a dense ~70B model once. If it's ~2-3 tok/s, you're bandwidth-bound — restrict the shortlist to low-active MoE.

## Step 2 — Shortlist by active params + a memory-fit check

Pick candidates whose **active** params suit the bandwidth budget and whose **resident** size (file + KV at your context, not file alone) fits with margin. Cross both a **local** set and a couple of **cloud** models as fixed reference baselines (you need a known-good ceiling to interpret local scores against).

## Step 3 — Build a per-model OFAT tuning matrix

For each shortlisted model, generate a matrix that varies **one factor at a time** from a baseline: temperature `{0.0, 0.3, 0.7, default}`, reasoning on/off, `echo_reasoning` on/off, turn/token budgets. Don't combine factors yet — you won't know which one moved the needle.

> **Verify:** the matrix must be N≥3 per cell. A single run lies (temp-0 isn't even deterministic on llama.cpp).

## Step 4 — Run serially, smoke first, with an OOM guard

On a single-GPU box, **one model resident at a time**. The workflow:

1. **Smoke** the new model (N=1, one trivial canary) to confirm the runtime can load the architecture and emit tool calls *before* committing to a multi-hour matrix.
2. Load with an **OOM guard**: unload all → verify zero resident → load target. Never trust that the previous unload landed.
3. Run the matrix vs the tier's canaries; **don't download or run anything else** during it (bandwidth contention tanks both).

## Step 5 — Record a finding card per model

For each model, capture a compact card — this is the durable output:

```
Model: qwen3.6-35b-a3b (35B-A3B, local $0)
Best config: temp 0.0, reasoning none, echo OFF, turns 80 + completeness directive
Result: medium 9/9 (winner) · heavy 8/9 peak / ~7/9 typical
Ceiling: hardest multi-file spec ~33% — capability/variance, NOT budget
Didn't work: echo on (8/9→6/9); turns 100 + 4M tokens (still ~33%)
Speed: ~5× faster than the 12B-active heavy model
```

Keep cards for **cloud** models too (the fixed baselines) so the leaderboard is comparable.

## Step 6 — Decide (leaderboard + deploy recipe)

Rank by the tier's pass-rate, then write the **deploy recipe** (the exact config) for the winner. Note the runner-up and *why* (e.g. "faster but caps at ~33% on the hardest spec").

## Decision points + dead-ends (where time actually goes)

> **Claim:** A small-active MoE can clear *most* of a heavy tier but hits a capability/variance ceiling on the hardest multi-file spec that **no budget lever fixes**.
> **Confidence:** high — confirmed by sweeping *both* turn and token budgets.
> **Verify:** if a task fails `budget_exceeded` at turns < cap, raise *tokens* not turns; if it still caps at ~33% with generous tokens *and* turns, it's capability — stop tuning.

## What didn't work (so you don't repeat it)

- **Skipping the smoke step.** A model whose tool-call format the runtime can't parse emits **0 tool calls** — caught in 1 minute by a smoke, or wasted as a multi-hour 0/N matrix.
- **Combining factors before OFAT.** You can't attribute the change. One factor at a time, then an "all-on" cell.
- **Trusting one run.** An 8/9 became a 6/9 on the next pass of the same config — variance is real; N≥3.
- **Assuming a cross-model default** (temp 0.3, echo settings). Per-model sweep, always.
- **Running a second model / a download during a benchmark.** Bandwidth contention invalidates the timing and can OOM the box.

## Cost/time

A per-model heavy matrix (a few cells × N=3) is ~1-5h on a single bandwidth-bound box at $0 marginal cost — fast models (3B-active) finish in ~1h, large-active models in ~5h. Budget accordingly and queue serially.

