Cost-Per-Pass, Not Cost-Per-Call: The Right Metric for Autonomous Agent Routing

Cost-Per-Pass, Not Cost-Per-Call#

Practitioners price LLMs by the per-token rate on the provider’s pricing page. For autonomous agents, that number is misleading. Two layers of indirection sit between the per-token rate and the cost you actually pay to get work done: variable prompt sizes turn per-token into per-call, and variable pass rates turn per-call into per-pass. Each layer can invert the ranking.

For autonomous fleets where failed attempts trigger reviewer cycles, retries, and reputational drag, cost-per-pass is the only metric that ranks models correctly. This article shows how to compute it, when it dominates, and where the cheapest-per-token model becomes the most expensive in production.

Reasoning-Model Tuning Asymmetry: Why Thin Prompts Beat Rich Prompts (and When They Don't)

Reasoning-Model Tuning Asymmetry#

Practitioners assume “better prompt = better output”. For one model class, that assumption is correct. For the other, the same prompt makes things measurably worse. This article documents the asymmetry, names the dividing line, and gives you a 4-cell test to confirm it on your own canary before you commit to a prompt.

The asymmetry is empirical, not theoretical. It shows up cleanly across four independent OFAT (one-factor-at-a-time) matrices run between 2026-05-18 and 2026-05-20: sonnet POC, grok matrix v1+v2, deepseek matrix v1, kimi matrix v1.

The d4-rich Prompt Pattern: Unlocking Non-Reasoning Models on Multi-File Tasks

The d4-rich Prompt Pattern#

Non-reasoning chat models (deepseek-V4-Flash, grok-4.3, kimi with thinking disabled) collapse on multi-file refactor tasks when given thin or baseline prompts. Pass rates of 0-33% on canaries that reasoning models clear at 67-100%. The cheap fix is a three-part prompt addendum: completion checklist, callsites-exhaustively-updated rule, and verify-before-push instruction. Drop it into the system prompt of a non-reasoning model and the canaries go green. Drop it into a reasoning model and you pay 12× more for 0% quality improvement.

xAI Grok Operational Quirks: Error Shapes, Rate-Limit HTML, and Per-Model Tool Surfaces

xAI Grok Operational Quirks#

xAI’s Grok API is OpenAI-compatible on paper. In practice it has more wire-format edge cases than any other provider in production: error responses change shape, rate-limit pages come back as HTML, assistant turns reject missing fields with HTTP 422, and the two flagship models (grok-4.3 and grok-4.20-reasoning) have incompatible parameter sets. Wrap it carelessly and the adapter crashes the conversation mid-turn.

This page is the production-confirmed quirks list, each as Symptom → Cause → Fix → Verify. Numbers come from two OFAT matrix runs (15 cells × N=3 baseline, 3 cells × N=5 validation) on api.x.ai and the heavy-tier POC. Full synthesis: ~/.claude/projects/-Users-mstather/memory/project_xai_adapter_wireerror_bug_2026_05_19.md and project_grok_matrix_v1_2026_05_19.md.