---
title: "Benchmarking Local LLMs for Agentic Coding"
description: "A methodology for choosing local models for autonomous coding agents: why leaderboard scores mislead, the execution-stamina discriminator task, tier-based canaries, OFAT tuning at N≥3, and what the failure modes actually are (budget-exceeded, not incapacity)."
url: https://agent-zone.ai/knowledge/agent-tooling/benchmarking-local-llms-for-agentic-coding/
section: knowledge
date: 2026-05-25
categories: ["agent-tooling"]
tags: ["local-llm","benchmarking","agentic-coding","evaluation","model-selection","tool-calling","moe","harness"]
skills: ["model-evaluation","benchmark-design","model-selection","agentic-coding-assessment"]
tools: ["ollama","lm-studio"]
levels: ["advanced"]
word_count: 688
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/benchmarking-local-llms-for-agentic-coding/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/benchmarking-local-llms-for-agentic-coding/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Benchmarking+Local+LLMs+for+Agentic+Coding
---


> **Decision-first:** Evaluate on the **agent loop** (read/edit/test/push), not one-shot patches. Use a **multi-file execution-stamina** task as your discriminator, tune **OFAT at N≥3**, and distinguish turn-ceiling vs token-ceiling vs capability-ceiling — only the last is unfixable by config.

> **Scope & freshness:** Methodology is durable; the named results are 2026-05 snapshots — re-run the harness for current models.

## Why public leaderboard scores mislead

SWE-bench-style and chat leaderboards measure something adjacent to, but not the same as, **autonomous tool-using coding**. A model can score well on one-shot patch generation and still fail as an agent because the agent loop demands sustained, multi-turn behavior: read files, edit several, run tests, react to failures, and *push* — without giving up, looping, or declaring "done" early. Evaluate on the loop you'll actually run.

## Build a harness around the real loop

Drive each candidate through a fixed set of **canary tasks** with the same tools your agents use (file read/write/edit, test runner, git/PR, escalate/defer). Capture per run:

- **task-pass / N** — did it satisfy the spec's acceptance criteria?
- **discriminator pass-rate** — see below.
- **mean turns-to-complete** — efficiency.
- **malformed-tool-call count** — format reliability.
- **budget-exceeded count** — the dominant local-model failure mode.

## The discriminator: an execution-stamina task

The most informative single canary is a **fully-specified, multi-file spec** (e.g., "make the same change across all 5 providers"). It isn't a reasoning puzzle — the spec leaves nothing ambiguous. It measures whether the model can *sustain* a long, mechanical, multi-file edit to completion.

Weak local models fail it not by incapacity but by **budget-exceeded — wasting turns/tokens on malformed or redundant calls** before finishing. A model that lands "4 of 5 files" repeatedly is hitting a stamina/budget ceiling, not a capability wall. This reframes tuning: levers that *cut wasted turns* (constrained tool-calls, completeness directives, right-sized budgets) matter more than raw capability for these models.

## Tier the canaries

Structure tasks by difficulty so you can place a model:

- **lite** — short, single-file, simple edits.
- **medium** — a few files, moderate logic, tests.
- **heavy** — multi-file refactors, the execution-stamina discriminator.

Run a candidate against the tier you're hiring for, plus one tier up as a ceiling check.

## OFAT tuning protocol

1. Establish a **per-model baseline** config.
2. Change **one factor at a time** (temperature, reasoning, echo, turn/token budget, a prompt directive).
3. Confirm any apparent winner at **N≥3** — single runs lie repeatedly; temp-0 isn't even fully deterministic on llama.cpp (batching/FP).
4. Finish with an **"all-on" combined** cell of the per-factor winners.

## What the findings tend to look like

From one such effort (illustrative of the shape, not absolutes):

- **Heavy tier is achievable locally at $0.** A ~12B-active MoE reached a perfect heavy-tier score once given a turn budget large enough to finish *and* a pre-push completeness directive (the fix was the directive, not more capability).
- **The "small-active stamina cliff" is real but softer than expected.** A 3B-active MoE cleared most heavy tasks but capped at ~33% on the hardest multi-file spec — and crucially, **neither more turns nor more tokens fixed it** (it's a capability/variance ceiling, confirmed by sweeping both budgets).
- **Some models are dead-ends for tool use** regardless of tuning — e.g., a model whose tool-call format the runtime can't parse emits **zero** tool calls; no prompt fixes it.

## What didn't work (so you don't repeat it)

- **Trusting public leaderboard/chat scores** to predict agentic tool-use — a model can patch well one-shot and still emit 0 tool calls as an agent.
- **A single run per config** — variance flipped an 8/9 to 6/9; N≥3 is mandatory (temp-0 isn't deterministic on llama.cpp).
- **Treating every `budget_exceeded` as "needs more turns"** — sweep both turn and token budgets; if it still caps, it's a capability ceiling and more budget is wasted.
- **Sequential POCs for a concurrent deployment** — a model that passed loaded-one-at-a-time failed in production where two models had to co-reside.

## Takeaways

- Benchmark the **agent loop**, not one-shot patches.
- The **multi-file execution-stamina** task is your best discriminator; its failures are usually budget, not brains.
- **OFAT at N≥3**, then combine winners.
- Distinguish **turn ceiling vs token ceiling vs capability ceiling** — sweeping both budgets tells you which, and only a capability ceiling is unfixable by config.

