Cost-Per-Pass, Not Cost-Per-Call: The Right Metric for Autonomous Agent Routing

Cost-Per-Pass, Not Cost-Per-Call#

Practitioners price LLMs by the per-token rate on the provider’s pricing page. For autonomous agents, that number is misleading. Two layers of indirection sit between the per-token rate and the cost you actually pay to get work done: variable prompt sizes turn per-token into per-call, and variable pass rates turn per-call into per-pass. Each layer can invert the ranking.

For autonomous fleets where failed attempts trigger reviewer cycles, retries, and reputational drag, cost-per-pass is the only metric that ranks models correctly. This article shows how to compute it, when it dominates, and where the cheapest-per-token model becomes the most expensive in production.

DeepSeek V4 Operational Quirks: Pro vs Flash, Reasoning Echo, and the Discount Cliff

DeepSeek V4 Operational Quirks#

DeepSeek V4 ships two models behind one OpenAI-compatible API: V4-Pro (reasoning) at $1.74/M input / $3.48/M output and V4-Flash (chat) at $0.28/M input / $1.10/M output. Until 2026-05-31 V4-Pro carries a 75% discount, putting it at $0.435/M input — cheap enough to use as a heavy-tier coding model. After that, the cost steps up 4×.

The two models live on the same endpoint but want very different things. V4-Pro behaves like a reasoning model (thin prompts, reasoning_content echo required, tool_choice restrictions). V4-Flash behaves like a chat model (rich prompts win dramatically; rejects nothing). Confuse them and your matrix lights up red.

Long-Term Metrics Storage: Thanos vs Grafana Mimir vs VictoriaMetrics

The Retention Problem#

Prometheus stores metrics on local disk with a default retention of 15 days. Most production teams extend this to 30 or 90 days, but local storage has hard limits. A single Prometheus instance cannot scale disk beyond the node it runs on. It provides no high availability – if the instance goes down, you lose scraping and query access. And each Prometheus instance only sees its own targets, so there is no unified view across clusters or regions.