Cost-Per-Pass, Not Cost-Per-Call: The Right Metric for Autonomous Agent Routing

Cost-Per-Pass, Not Cost-Per-Call#

Practitioners price LLMs by the per-token rate on the provider’s pricing page. For autonomous agents, that number is misleading. Two layers of indirection sit between the per-token rate and the cost you actually pay to get work done: variable prompt sizes turn per-token into per-call, and variable pass rates turn per-call into per-pass. Each layer can invert the ranking.

For autonomous fleets where failed attempts trigger reviewer cycles, retries, and reputational drag, cost-per-pass is the only metric that ranks models correctly. This article shows how to compute it, when it dominates, and where the cheapest-per-token model becomes the most expensive in production.

An End-to-End Workflow for Evaluating & Tuning Local LLMs for Agents

Decision-first: Follow this order and you’ll have a deployable model + tuned config in days, not weeks: (1) scope the hardware, (2) shortlist by active params, (3) per-model OFAT matrix, (4) run serially with an OOM guard (smoke first), (5) write a finding card per model, (6) decide. The expensive mistakes are skipping the smoke step, sweeping more than one factor at once, and trusting a single run.

Scope & freshness: Process is model/hardware-independent; the worked numbers are from a 2026-05 effort on a GB10 (128 GB) + an Apple-Silicon Mac, evaluating local MoE models vs cloud baselines for agentic coding. Re-validate the findings, not the workflow.

Benchmarking Local LLMs for Agentic Coding

Decision-first: Evaluate on the agent loop (read/edit/test/push), not one-shot patches. Use a multi-file execution-stamina task as your discriminator, tune OFAT at N≥3, and distinguish turn-ceiling vs token-ceiling vs capability-ceiling — only the last is unfixable by config.

Scope & freshness: Methodology is durable; the named results are 2026-05 snapshots — re-run the harness for current models.

Why public leaderboard scores mislead#

SWE-bench-style and chat leaderboards measure something adjacent to, but not the same as, autonomous tool-using coding. A model can score well on one-shot patch generation and still fail as an agent because the agent loop demands sustained, multi-turn behavior: read files, edit several, run tests, react to failures, and push — without giving up, looping, or declaring “done” early. Evaluate on the loop you’ll actually run.

Choosing a Local Model: Size Tiers, Task Matching, and Cost Comparison with Cloud APIs

Choosing a Local Model#

The most expensive mistake in local LLM adoption is running a 70B model for a task that a 3B model handles at 20x the speed for equivalent quality. The second most expensive mistake is running a 3B model on a task that requires 32B-level reasoning and getting garbage output.

Matching model size to task complexity is the core skill. This guide provides a framework grounded in empirical benchmarks, not marketing claims.