OFAT Matrix LLM Tuning: A Methodology for Picking Sampling Params, Tool Configs, and Prompts Without Guessing

OFAT Matrix LLM Tuning#

When a new provider or model lands and you have to decide what temperature, max_tokens, tool_choice, prompt-shape, and turn budget to ship in production, the default is to pick by hunch. Read the model card, copy a partner adapter’s defaults, ship. A week later you find out reasoning_effort=high doubled cost for no quality gain, max_tokens=2048 silently truncated half your tier-3 runs, and the “prompt-rich” pattern you copied from grok-4.3 actively hurts kimi.

The Five-Agent Research Pattern: Surveying a New LLM Provider Before You Tune It

The Five-Agent Research Pattern#

Adopting a new LLM provider for a coding-agent role looks easy from the docs. Read the model card, copy the partner adapter’s defaults, ship. A week later you find out the provider rejects tool_choice=required in thinking mode, the docs lied about reasoning_content echoing, and your retry loop multiplies the per-turn timeout by 3x because the rate-limit response isn’t JSON.

The docs miss what was patched after release. The community catches what the docs miss. Partner adapters encode lived defaults nobody published. Your own adapter has bugs you can’t see from inside it. Reading any one of these in isolation gets you to “I think I understand this provider.” Reading all five in parallel gets you a knob list, an open-contradictions list, and a list of bugs to fix before the matrix runs. The pattern: spawn 5 parallel research sub-agents, one per angle, then synthesize.