Tiered-LLM Tooling: Local Model by Default, Escalate to the Frontier Model

Tiered-LLM Tooling: Local by Default, Escalate to Frontier#

When you build a chat or ops interface backed by an LLM, paying a frontier model for every interaction is wasteful — most interactions are cheap lookups, summaries, and routing. A tiered design serves the high-frequency majority with a small local model (e.g. an Ollama-served model on a GPU you already have) and escalates to a frontier model (e.g. Claude) only for the hard minority.

Benchmarking Local LLMs for Agentic Coding

Decision-first: Evaluate on the agent loop (read/edit/test/push), not one-shot patches. Use a multi-file execution-stamina task as your discriminator, tune OFAT at N≥3, and distinguish turn-ceiling vs token-ceiling vs capability-ceiling — only the last is unfixable by config.

Scope & freshness: Methodology is durable; the named results are 2026-05 snapshots — re-run the harness for current models.

Why public leaderboard scores mislead#

SWE-bench-style and chat leaderboards measure something adjacent to, but not the same as, autonomous tool-using coding. A model can score well on one-shot patch generation and still fail as an agent because the agent loop demands sustained, multi-turn behavior: read files, edit several, run tests, react to failures, and push — without giving up, looping, or declaring “done” early. Evaluate on the loop you’ll actually run.

Tuning Local LLMs for Agentic Coding: Sampling, Reasoning, and Budgets

Decision-first: Per new model, sweep temperature (don’t assume 0.3), try reasoning off for builders, test echo_reasoning both ways, and on budget_exceeded check turns-vs-tokens before changing either. The right config is model-specific — assume nothing.

Scope & freshness: Local + cloud models for agentic coding, 2026-05. Findings are per-model (see the specific models named); treat them as examples of shape, not universal constants — re-sweep for any new model.