---
title: "OFAT Matrix LLM Tuning: A Methodology for Picking Sampling Params, Tool Configs, and Prompts Without Guessing"
description: "One-Factor-At-a-Time matrix methodology for systematically tuning an LLM in a coding agent role. Cell design recipe, canary selection, budget envelope math, and the aggregation traps that quietly invalidate matrix results."
url: https://agent-zone.ai/knowledge/agent-tooling/ofat-matrix-llm-tuning/
section: knowledge
date: 2026-05-20
categories: ["agent-tooling"]
tags: ["llm-tuning","ofat","matrix","benchmarking","evaluation","coding-agents","moonshot","deepseek","xai"]
skills: ["llm-evaluation","matrix-design","coding-agent-tuning"]
tools: ["go","bash","moonshot","deepseek"]
levels: ["intermediate","advanced"]
word_count: 1782
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/ofat-matrix-llm-tuning/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/ofat-matrix-llm-tuning/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=OFAT+Matrix+LLM+Tuning%3A+A+Methodology+for+Picking+Sampling+Params%2C+Tool+Configs%2C+and+Prompts+Without+Guessing
---


# OFAT Matrix LLM Tuning

When a new provider or model lands and you have to decide what `temperature`, `max_tokens`, `tool_choice`, prompt-shape, and turn budget to ship in production, the default is to pick by hunch. Read the model card, copy a partner adapter's defaults, ship. A week later you find out `reasoning_effort=high` doubled cost for no quality gain, `max_tokens=2048` silently truncated half your tier-3 runs, and the "prompt-rich" pattern you copied from grok-4.3 actively hurts kimi.

One-Factor-At-a-Time (OFAT) matrices fix this. Pick a baseline. Vary ONE knob per additional cell. Run each cell N times against fixed canaries. Aggregate by per-task verdict, not per-cell stdout. The whole study fits in a ~$10-30 budget and a few hours of wallclock.

## TL;DR — what an OFAT run looks like

- 5-10 cells, each varying ONE knob vs a documented baseline
- 1-2 canaries that mirror your real workload shape (NOT toy tasks)
- N=3 runs per (cell, canary) pair — variance floor for binary pass/fail
- Total: 30-60 runs, $10-30 budget on cloud models
- Aggregate from per-task `verdict.json`, NOT from the harness summary CSV
- Output: a per-cell pass-rate + cost-per-pass table; ship the winner, document the losers

## Problem

You're integrating a new LLM into a coding-agent role. The provider's docs give defaults. Partner adapters (Cline, RooCode, Continue.dev) give different defaults. Community guidance gives a third set. Your existing adapter has its own legacy choices.

If you ship by hunch you will be wrong. Recent failures observed across three production matrices:

- "Higher `reasoning_effort` is always better" — false for grok-4.20-reasoning (API rejects the param) and for kimi-k2.6 (no quality gain, 46% cost increase)
- "Rich/example-heavy prompts help" — true for grok-4.3 (0/3 → 2/3 on heavy tier), false for kimi-k2.6 (12x cost regression on tier-2 with no quality gain), false for deepseek-reasoner, sonnet, grok-reasoning
- "More turns unlocks multi-file work" — false in every matrix run; budget beyond ~50 turns adds wallclock without adding capability
- "Setting `prompt_cache_key` saves 6x on input" — kimi matrix observed cache misses on every run with the key set, AND a 44% cost INCREASE

You can't reason your way around these from docs. You have to measure.

## The OFAT pattern

Pick a baseline cell that represents "what I'd ship if forced to ship right now." Usually this is documented defaults + the minimum overrides your adapter needs (e.g. `models.max_output_tokens: 32000` for kimi, because the 2048 default silently truncates).

For each knob you want to test, create one cell that differs from baseline by exactly that knob. Nothing else changes. The cell name encodes the knob: `kimi-thinking-off`, `grok43-reasoning-high`, `dr-echo-conditional`.

Run every cell N times against the same canaries. Compare per-cell pass-rate, cost, wallclock, defer-rate. A winner is a cell that beats baseline on at least one canary without losing on the other.

## Cell design recipe

```
baseline = documented defaults + minimum-viable adapter overrides
cells = [baseline] + [baseline_with_one_change for change in candidate_knobs]
```

Candidate knobs to vary, ranked by typical impact:

| Knob | When to test | Skip when |
|---|---|---|
| `temperature` | non-reasoning chat models | API forces 1.0 in thinking mode (kimi-k2.6, deepseek-reasoner) |
| `max_tokens` | reasoning models | already at provider ceiling |
| `tool_choice` | optional-tool flows | API restricts to `auto`/`none` (kimi thinking mode) |
| `strict` on function defs | JSON-arg loop suspected | provider doesn't support strict |
| `reasoning_effort` | reasoning-capable models | API rejects (grok-4.20-reasoning) |
| `prompt structure` (thin vs rich) | always worth one cell | only if you have ONE canary |
| `max_turns` | "model keeps deferring" failures | model already converges in <20 turns |
| `prompt_cache_key` | provider supports prompt caching | no cache support |

What NOT to vary in the matrix — list these as "held constants" in your design doc:

- Settings the API rejects in your chosen mode (kimi thinking mode forces `temperature=1.0`, `top_p=0.95`, `presence_penalty=0`, `n=1`)
- Settings the docs + 3+ partner adapters agree on (e.g. kimi `thinking.keep=all`, `reasoning_content` echo)
- Endpoint / region (region-locked keys; tested elsewhere)
- HTTP-client config (timeout, retry — tested elsewhere in your adapter audit)

## Canary selection

A canary is a complete task with deterministic acceptance criteria. The point is: when you compare cells, you need an apples-to-apples task surface.

Rules:

- **Pick 1 canary that maps to your real workload shape.** If your pool handles REQUIRED-FIX bug-bundle work, your canary should be a multi-file bug bundle, not a clean greenfield feature. The kimi matrix used `canaries/tier-2/agent-runtime-bug-bundle` exactly because that's what bm-3 actually sees.
- **Optionally add 1 second canary at a different complexity tier** to detect workload-dependent winners. The kimi matrix added `canaries/tier-3/197e215e-record-cost-on-send-error` (heavy refactor) and discovered that `thinking-off` wins tier-2 cost by 27% but goes 0/3 on tier-3.
- **Avoid synthetic toy tasks.** "Write a function that reverses a string" doesn't exercise the failure modes that matter: tool-call loops, JSON-arg corruption, multi-file consistency, signature-callsite propagation.
- **Reuse canaries across matrices when possible.** If your previous deepseek matrix used canary X, use the same X for the kimi matrix. Cross-provider comparisons become legitimate.

## Budget envelope math

```
total_runs   = cells × canaries × N
total_cost   = total_runs × $/run
wallclock_hr = total_runs × avg_run_seconds / 3600 / PARALLEL
```

Real numbers from the kimi-k2.6 matrix:

```
cells     = 8
canaries  = 2
N         = 3
total     = 48 runs
$/run     ≈ $0.45 tier-2, $1.61 tier-3 (baseline) → ~$2.06 per matched pair
total_$   = $24 (actual)
wallclock = ~4 hours at PARALLEL=3
```

DeepSeek matrix:

```
cells = 16, N=3, single canary = 48 runs × $0.13 avg = $6.26 actual
```

Grok matrix (v1):

```
cells = 15, N=3, single canary = 45 runs, contaminated by rate-limit
v2 rerun at PARALLEL=1 was clean → 45 × $0.45 avg = $20
```

Plan for the upper end. If your tier-3 canary is expensive, fewer cells beats more N — N=3 still gives 4-step resolution (0/3, 1/3, 2/3, 3/3) on binary outcomes.

## Aggregation: the trap that quietly invalidates matrices

This is the methodology rake that costs the most time when stepped on. The reference harness script `run-matrix.sh` writes a `results.csv` like:

```csv
cell,attempt,verdict,cost_usd
kimi-baseline,1,pass,1.611
kimi-baseline,2,fail,1.732
kimi-baseline,3,pass,1.490
```

The `verdict` column reads ONLY the first canary's `verdict.json`:

```bash
verdict=$(jq -r '.criteria_results | ...' "$out"/per-task/*/verdict.json | head -1)
```

If you have two canaries per run, the `results.csv` summary silently ignores the second one. A cell that's 3/3 on canary-1 and 0/3 on canary-2 looks like a 3/3 winner in the CSV.

The fix is to aggregate from the per-task verdict files directly:

```bash
# correct aggregation: walk every per-task/<canary>/verdict.json
for cell_dir in bench-results/$RUN_ID/*/; do
    cell=$(basename "$cell_dir")
    for canary_dir in "$cell_dir"run*/per-task/*/; do
        canary=$(basename "$canary_dir")
        verdict=$(jq -r '.criteria_results | if all(.passed) then "pass" else "fail" end' \
                  "$canary_dir/verdict.json")
        cost=$(jq -r '.cost_usd // 0' "$canary_dir/verdict.json")
        echo "$cell,$canary,$verdict,$cost"
    done
done | python3 aggregate.py
```

Then compute per-(cell, canary) pass-rate and cost-per-pass, NOT per-cell aggregates. A per-cell pass-rate that averages across canaries hides workload-shape effects.

## A sample `gen-matrix-configs.sh` skeleton

The cell-template pattern: a base YAML config + a function that emits one file per cell with a single field changed.

```bash
#!/usr/bin/env bash
# Generates OFAT matrix configs. Each cell flips ONE knob vs baseline.
set -euo pipefail
OUT="configs/matrix-kimi"
mkdir -p "$OUT"

emit() {
    local cell="$1"
    cat > "$OUT/${cell}.yaml"
    echo "wrote $OUT/${cell}.yaml"
}

# baseline — documented defaults + minimum-viable overrides
emit "kimi-baseline" <<EOF
role: heavy
model: moonshot/kimi-k2.6
budget_turns: 48
budget_wallclock_seconds: 1800
prompt_overrides:
  system_addendum: |
    ## Heavy-tier scope (DO NOT defer on multi-file scope alone)
provider_overrides:
  thinking:
    type: enabled
    keep: all
  max_tokens: 32000
EOF

# vary ONE knob: thinking-off
emit "kimi-thinking-off" <<EOF
role: heavy
model: moonshot/kimi-k2.6
budget_turns: 48
budget_wallclock_seconds: 1800
prompt_overrides:
  system_addendum: |
    ## Heavy-tier scope (DO NOT defer on multi-file scope alone)
provider_overrides:
  thinking:
    type: disabled
  temperature: 0.6
  max_tokens: 32000
EOF

# vary ONE knob: max_tokens at docs floor
emit "kimi-max-tokens-16k" <<EOF
role: heavy
model: moonshot/kimi-k2.6
budget_turns: 48
budget_wallclock_seconds: 1800
prompt_overrides:
  system_addendum: |
    ## Heavy-tier scope (DO NOT defer on multi-file scope alone)
provider_overrides:
  thinking:
    type: enabled
    keep: all
  max_tokens: 16000
EOF

# ... repeat for each candidate knob, ONE change per cell
```

Run with:

```bash
N=3 PARALLEL=3 CONFIG_DIR=configs/matrix-kimi ./scripts/run-matrix.sh
```

## What an OFAT matrix can't tell you

Single-canary signal does not generalize to all workload shapes. A cell that's 3/3 on tier-2 bug bundles may be 0/3 on tier-3 multi-file refactors. The kimi matrix's `kimi-thinking-off` cell is the canonical case: best tier-2 cost in the matrix, complete capability loss on tier-3. Without the second canary you'd ship a regression.

N=3 has wide confidence intervals. A 2/3 vs 3/3 difference is one run. Don't claim "X beats Y" on a one-vote margin — call it noise unless N goes up. The grok-matrix-v1 → v2 comparison shows the volatility: cells that looked like 2/3 winners in v1 went 0/3 in v2 on the same config.

Matrix budget caps the number of knobs. Eight cells × two canaries × N=3 = 48 runs. If you have 15 candidate knobs, pick the eight with the strongest priors and skip the rest. Use the 5-agent research pattern (see [five-agent-research-pattern.md](five-agent-research-pattern.md)) to prune before designing the matrix.

OFAT misses interaction effects by design. If `reasoning_effort=high` only helps when paired with `prompt-d4-rich`, OFAT won't find it — each cell varies one knob. Add a small Phase 2b after the OFAT round to test the top 2-3 winners in combination.

## Common Mistakes

**Testing settings the API forces.** A cell that sets `temperature=0.1` for kimi-k2.6 in thinking mode returns API errors on every run — a wasted cell. Verify the held-constants list against actual API behavior before generating configs. The kimi research synthesis listed 11 forced constants before the matrix design started.

**Testing too many knobs at once.** If a cell changes `temperature`, `max_tokens`, AND `tool_choice`, you can't attribute a verdict to any single knob. Each cell varies exactly one. If you want interaction effects, do a separate Phase 2b after the OFAT winners are known.

**Skipping the baseline cell.** Without an anchor, "this cell got 2/3" has no meaning. The baseline establishes the reference pass-rate at the chosen N. The kimi matrix found `baseline` was the optimal cell — every variant degraded. That's a real result only because the baseline was in the matrix.

**Trusting `results.csv` from harness scripts that only read the first canary's verdict.** This is the trap from "Aggregation" above. The kimi matrix initially looked like 5 cells tied at 3/3 — until per-task aggregation revealed that 4 of those 5 were 0/3 on the tier-3 canary.

**Per-cell aggregation across canaries.** Reporting "kimi-thinking-off averaged 50% pass" hides that it's 100% on one canary and 0% on the other. Always stratify by (cell, canary).

**Drawing strong conclusions from a contaminated run.** The grok v1 matrix was 82% rate-limited; the two "passing" cells happened to be ones with naturally slow API calls that avoided the rate-limit. v2 at PARALLEL=1 contradicted both signals. Verify the failure mode before ranking cells.

**Stopping after the matrix without writing the doc.** The matrix's output is a 2-3 page memo: winner cell, explicit losers and why, what to ship in the production pod config, what to file as follow-up work. Without the memo, the next person tunes by hunch again.

