Cost-Per-Pass, Not Cost-Per-Call: The Right Metric for Autonomous Agent Routing

Cost-Per-Pass, Not Cost-Per-Call#

Practitioners price LLMs by the per-token rate on the provider’s pricing page. For autonomous agents, that number is misleading. Two layers of indirection sit between the per-token rate and the cost you actually pay to get work done: variable prompt sizes turn per-token into per-call, and variable pass rates turn per-call into per-pass. Each layer can invert the ranking.

For autonomous fleets where failed attempts trigger reviewer cycles, retries, and reputational drag, cost-per-pass is the only metric that ranks models correctly. This article shows how to compute it, when it dominates, and where the cheapest-per-token model becomes the most expensive in production.

LLM Adapter Audit Checklist: 10 Bugs That Hide in OpenAI-Compatible Providers

LLM Adapter Audit Checklist#

When you wrap an OpenAI-compatible LLM provider (Moonshot, DeepSeek, xAI, Together, Fireworks, OpenRouter, vLLM, anything else that exposes POST /v1/chat/completions) in a Go HTTP client, the same ten bug classes show up. They all silently degrade or break the agent — none of them crash loudly. Each was observed in production across at least one of xAI, DeepSeek, or Moonshot during a two-week audit period.

This checklist is the audit. Run it against any new adapter before shipping. Each entry is Symptom → Cause → Fix with a code shape you can grep your repo for.

Reasoning-Model Tuning Asymmetry: Why Thin Prompts Beat Rich Prompts (and When They Don't)

Reasoning-Model Tuning Asymmetry#

Practitioners assume “better prompt = better output”. For one model class, that assumption is correct. For the other, the same prompt makes things measurably worse. This article documents the asymmetry, names the dividing line, and gives you a 4-cell test to confirm it on your own canary before you commit to a prompt.

The asymmetry is empirical, not theoretical. It shows up cleanly across four independent OFAT (one-factor-at-a-time) matrices run between 2026-05-18 and 2026-05-20: sonnet POC, grok matrix v1+v2, deepseek matrix v1, kimi matrix v1.

xAI Grok Operational Quirks: Error Shapes, Rate-Limit HTML, and Per-Model Tool Surfaces

xAI Grok Operational Quirks#

xAI’s Grok API is OpenAI-compatible on paper. In practice it has more wire-format edge cases than any other provider in production: error responses change shape, rate-limit pages come back as HTML, assistant turns reject missing fields with HTTP 422, and the two flagship models (grok-4.3 and grok-4.20-reasoning) have incompatible parameter sets. Wrap it carelessly and the adapter crashes the conversation mid-turn.

This page is the production-confirmed quirks list, each as Symptom → Cause → Fix → Verify. Numbers come from two OFAT matrix runs (15 cells × N=3 baseline, 3 cells × N=5 validation) on api.x.ai and the heavy-tier POC. Full synthesis: ~/.claude/projects/-Users-mstather/memory/project_xai_adapter_wireerror_bug_2026_05_19.md and project_grok_matrix_v1_2026_05_19.md.