---
title: "Cost-Per-Pass, Not Cost-Per-Call: The Right Metric for Autonomous Agent Routing"
description: "Per-token rates and per-call costs are misleading metrics for autonomous agents. Use cost-per-pass = cost-per-call divided by pass rate. Rankings invert; the cheapest-looking model can be the most expensive."
url: https://agent-zone.ai/knowledge/agent-tooling/cost-per-pass-not-cost-per-call/
section: knowledge
date: 2026-05-20
categories: ["agent-tooling"]
tags: ["cost-optimization","model-selection","routing","deepseek","kimi","grok","sonnet","economics"]
skills: ["cost-modeling","model-selection","fleet-economics"]
tools: ["anthropic","moonshot","deepseek","xai"]
levels: ["intermediate","advanced"]
word_count: 1307
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/cost-per-pass-not-cost-per-call/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/cost-per-pass-not-cost-per-call/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Cost-Per-Pass%2C+Not+Cost-Per-Call%3A+The+Right+Metric+for+Autonomous+Agent+Routing
---


# Cost-Per-Pass, Not Cost-Per-Call

Practitioners price LLMs by the per-token rate on the provider's pricing page. For autonomous agents, that number is misleading. Two layers of indirection sit between the per-token rate and the cost you actually pay to get work done: variable prompt sizes turn per-token into per-call, and variable pass rates turn per-call into per-pass. Each layer can invert the ranking.

For autonomous fleets where failed attempts trigger reviewer cycles, retries, and reputational drag, **cost-per-pass is the only metric that ranks models correctly**. This article shows how to compute it, when it dominates, and where the cheapest-per-token model becomes the most expensive in production.

## TL;DR for agents

- If you are routing autonomous work, rank by `$/pass = $/call ÷ pass_rate`, not by `$/M-tokens` or `$/call`.
- If a model has 25% pass rate, its real cost is 4× the per-call cost. A model at $0.16/call with 25% pass costs $0.64/pass — more than a 100%-pass model at $0.17.
- If you cannot retry (single-shot decision, no reviewer), `$/call` is the right metric. Otherwise use `$/pass`.
- For high-value specs you cannot afford to retry, pay the premium tier. Sonnet at $7.73/pass beats grok-4.3 at $0.63/pass on any spec where the architect cares about getting it done first try.
- Recompute `$/pass` whenever your canary shape changes. Heavy multi-file and tier-2 single-file have completely different rankings.

## The Data — Same Canary, Different Rankings

All numbers below are from `canaries/tier-3/197e215e-record-cost-on-send-error` — a heavy multi-file refactor across 5 providers' error paths. N=2–4 per cell (small sample; treat as direction, not precision).

| Model | Pass rate | $/call | $/pass | Per-token rate |
|---|---|---|---|---|
| deepseek-v4-flash + d4-rich | 100% (3/3) | $0.04 | **$0.04** | $0.28/M in, $1.10/M out |
| deepseek-v4-pro (75% discount) | 100% (3/3) | $0.17 | $0.17 | $0.44/M in, $0.87/M out |
| anthropic/sonnet-4-6 | 100% (2/2) | $7.73 | $7.73 | $3.00/M in, $15.00/M out |
| moonshot/kimi-k2.6 baseline | 67% (2/3) | $1.61 | $2.42 | $0.95/M in, $4.00/M out |
| xai/grok-4.3 (tuned prompt) | 25% (1/4) | $0.16 | $0.63 | $0.20/M in, $1.50/M out |
| xai/grok-4.20-reasoning | 0% (0/2) | timed out | ∞ | n/a |

The ranking by per-token rate is grok-4.3 < deepseek-flash < deepseek-pro < kimi < sonnet. The ranking by $/pass is deepseek-flash < deepseek-pro < grok-4.3 < kimi < sonnet — with grok-4.20-reasoning ejected entirely because it never finishes.

Look at the kimi row: 67% pass turns a $1.61 call into a $2.42 pass. Look at the grok row: 25% pass turns a $0.16 call into a $0.63 pass — still cheap, but the gap to deepseek-flash widens from 4× to 16×.

## The Math

Expected attempts under independent retries follow a geometric distribution. For pass rate `p`:

```
expected_attempts = 1 / p
expected_cost     = $/call × expected_attempts = $/call ÷ p = $/pass
```

This is a floor. It assumes independent attempts (no learning), failed attempts cost ~the same as successful ones (true within ~20% in our data), and binary pass/fail acceptance. Real fleet cost is higher because failed attempts also burn reviewer cycles, accumulate hub_events that confuse PM dispatch, and inflate latency for downstream consumers.

## 100-Dispatch Projection

To make the difference concrete, project 100 heavy specs through each model. Retry until each spec passes:

| Model | First-pass yield | Retries needed | Cumulative cost |
|---|---|---|---|
| deepseek-v4-flash + d4-rich | 100% | 0 | **$4** |
| deepseek-v4-pro (discount) | 100% | 0 | $17 |
| grok-4.3 (tuned) | 25% | ~300 (geometric expectation) | $63 |
| kimi-k2.6 baseline | 67% | ~50 | $242 |
| anthropic/sonnet-4-6 | 100% | 0 | $773 |
| grok-4.20-reasoning | 0% | unbounded | ∞ |

The grok row is the most instructive: low per-call cost ($0.16) makes grok look like an obvious cheap choice. Multiply by the 4× retry tax and it costs more than deepseek-pro per-pass. The 200× gap between flash and sonnet at identical 100% pass is the prize for routing by `$/pass` correctly.

## When Per-Call Cost Matters More

Per-call cost is the right metric when:

- **Single-shot decisions where a wrong answer is acceptable to ship**: low-stakes classification, draft summaries a human edits anyway, opportunistic suggestions.
- **Human-reviewer-at-the-end workloads where partial output is useful**: review comments where "I'm 60% sure this is a bug" is signal even when wrong.
- **Streaming generation where the user can interrupt**: chat assistance bounded by user attention.

Most autonomous agent work is not in this category. Builder pools, reviewer pools, and PM dispatch all recover from per-call failure with another call. `$/pass` is the metric.

## The Expensive-But-Perfect Tier

Sonnet at $7.73/pass looks indefensible next to deepseek-flash at $0.04. Why use it?

Because cheaper-tier pass rates are conditional on **spec clarity you don't always have**. The flash 100% rate is on a concrete, file-listed, binary-acceptance spec. On a fuzzy spec where the architect is mid-design, flash's rate drops sharply (~40% in production observation; not in the matrix data).

Sonnet absorbs ambiguity and pushes back where flash defers or ships the wrong interpretation. For the 5–10% of specs where the architect can't clarify upfront, re-dispatching costs more than the model premium, and the downstream blocker is a release timeline — sonnet is the right target. Pay 200× to avoid the fuzzy-spec retry tax.

The mistake is using sonnet for all specs. At 100 dispatches/week, that's $773/week = $40K/year for one builder. Reserve it for specs that need it.

## Canary-Shape Sensitivity

Rankings flip by canary shape. The table above ranks for heavy multi-file. For tier-2 single-file bug fixes, kimi-k2.6 passes 3/3 on every variant at sub-dollar cost: `thinking-off` wins by `$/call` at $0.034 (saving 27% reasoning overhead). That same `thinking-off` variant drops to 0/3 on the tier-3 canary above.

**One model, two canaries, two different "best" configs.** Recompute `$/pass` whenever your canary shape changes.

## How to Build a $/Pass Table for Your Fleet

1. **Pick a representative canary per work tier.** Tier-3 = heavy multi-file, tier-2 = single-file or REQUIRED-FIX. Binary acceptance — no "the model tried hard" verdicts.
2. **Run N≥3 per (model, canary).** Aggregate pass rate from per-task verdict files, not summary CSVs (most harnesses log only the first canary's verdict).
3. **Compute `$/call` from real provider rates.** Adapter rate cards drift; verify against the provider's billing dashboard. A fleet that fell back to Sonnet rates for unknown models over-billed kimi-k2.6 by 5×.
4. **Divide.** That table is the routing source-of-truth.
5. **Re-run on model upgrade or pricing change.** The deepseek V4-Pro 75% discount expires 2026-05-31 — every $/pass row that uses V4-Pro will be wrong on June 1.

## Common Mistakes

**Comparing per-token rates across providers without canary data.** A $0.20/M input model can cost more per pass than a $3.00/M model if its pass rate is low enough.

**Picking the cheapest model in a pool config without measuring pass rate.** The most common production failure. Architect picks a model from the pricing page; nobody runs a canary; the pool ships at 30% pass and the team notices a week later when REQUIRED-FIX backlog dominates.

**Ignoring that pass rate is canary-shape-dependent.** Kimi at 100% on tier-2 and 67% on tier-3 are both real. Routing all kimi work as if 100% pass overcommits to tier-3 it can't finish.

**Conflating `$/call` from your tracker with real billing.** If your adapter has a rate-card fallback to a different model's rates, the tracker is fiction. Audit against the provider dashboard before trusting the routing table.

**Forgetting retries aren't free.** Geometric-expectation `$/pass` is a floor. Real fleet cost adds reviewer cycles, PM dispatch overhead, and branch-pollution recovery — multiplier of 1.5–2× on high-failure-rate models.

## Putting It All Together

```
1. Pick a canary matching the spec's tier and shape.
2. Look up $/pass for each pool's model on that canary tier.
3. Choose the cheapest $/pass that meets urgency and spec-clarity.
4. If the spec is fuzzy, escalate one tier.
5. If the spec is concrete, take the cheapest pass-rate ≥ 90% model.
```

The two single-biggest cost wins in our 2026-05-19 → 2026-05-20 matrix work:
- Shipping deepseek-v4-flash + d4-rich to `builder-heavy-fast` displaced sonnet for concrete heavy specs. ~$770 → ~$4 per 100 dispatches.
- Keeping kimi-k2.6 at baseline (not promoting d4-rich) avoided a 12× cost-per-call tier-2 regression.

Neither was visible from per-token rate sheets. Both came from $/pass tables built on real canary data. That is the routing discipline.

