---
title: "Reasoning-Model Tuning Asymmetry: Why Thin Prompts Beat Rich Prompts (and When They Don't)"
description: "Empirical asymmetry across four OFAT prompt-tuning matrices. Reasoning models penalize rich prompts; non-reasoning chat models depend on them. Routing rule, falsifiable test, and reproduction recipe."
url: https://agent-zone.ai/knowledge/agent-tooling/reasoning-model-tuning-asymmetry/
section: knowledge
date: 2026-05-20
categories: ["agent-tooling"]
tags: ["prompt-engineering","reasoning-models","kimi","deepseek","grok","sonnet","ofat","tuning"]
skills: ["prompt-engineering","model-evaluation","ab-testing"]
tools: ["go","moonshot","deepseek","xai"]
levels: ["intermediate","advanced"]
word_count: 1302
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/reasoning-model-tuning-asymmetry/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/reasoning-model-tuning-asymmetry/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Reasoning-Model+Tuning+Asymmetry%3A+Why+Thin+Prompts+Beat+Rich+Prompts+%28and+When+They+Don%27t%29
---


# Reasoning-Model Tuning Asymmetry

Practitioners assume "better prompt = better output". For one model class, that assumption is correct. For the other, the same prompt makes things measurably worse. This article documents the asymmetry, names the dividing line, and gives you a 4-cell test to confirm it on your own canary before you commit to a prompt.

The asymmetry is empirical, not theoretical. It shows up cleanly across four independent OFAT (one-factor-at-a-time) matrices run between 2026-05-18 and 2026-05-20: sonnet POC, grok matrix v1+v2, deepseek matrix v1, kimi matrix v1.

## TL;DR for agents

- If your model has an internal reasoning channel (Sonnet 4.6, Opus, kimi-k2.6, deepseek-V4-Pro, grok-4.20-reasoning), **start with the thinnest prompt that compiles** and only add scaffolding when a specific failure mode demands it.
- If your model is a non-reasoning chat model (deepseek-V4-Flash, grok-4.3, kimi-thinking-off, gemini-2.5-flash), **start with the d4-rich prompt** — checklist + callsite-exhaustiveness rule + verify-before-push. Without it you get sub-50% pass rates.
- Run a 4-cell matrix `{baseline, d4-rich} × {thin, plus-callsites}` at N≥3 before committing — self-ask answers are unreliable; only canary data settles which side of the line a new model sits on.
- Never copy a prompt from a chat-model deployment to a reasoning-model deployment without re-validating. Cost can jump 12× and pass rates can collapse to 0/3.
- If a reasoning model already wraps `<thinking>` blocks internally, do not add "let's think step by step" — you are paying for redundant reasoning that competes with the model's own.

## The Data

Same canary (`tier-3/197e215e-record-cost-on-send-error`, a heavy multi-file refactor), same harness, same d4-rich addendum applied as the only varied factor:

| Model | Type | Baseline pass | + d4-rich prompt | Effect |
|---|---|---|---|---|
| deepseek-V4-Flash | non-reasoning chat | 33% (1/3) | 100% (3/3) | **+200% — d4-rich rescues it** |
| grok-4.3 | non-reasoning chat | 0% (0/3) | 40% (production-tuned) | d4-rich is the unlock |
| kimi-k2.6 | reasoning | 67% tier-3 (2/3) | 0% tier-3, 12× more $ on tier-2 | **d4-rich is catastrophic** |
| grok-4.20-reasoning | reasoning | n/a clean baseline | 0/2 with d4 | overall reasoning fail |
| deepseek-V4-Pro | reasoning | 100% (3/3) @ $0.17 | 67% (2/3) @ $0.27 | d4-rich hurts both pass and cost |
| anthropic/sonnet-4-6 | reasoning | 100% (2/2) @ $7.73 | not re-tested | baseline already wins |

The split is clean: every reasoning model in the table is harmed by the same d4-rich prompt that helps every non-reasoning model.

## The Pattern: What d4-Rich Actually Contains

The d4-rich addendum is a small block — checklist + callsite rule + verify-before-push. Copied verbatim from the kimi matrix config generator:

```
## Completion checklist (kimi-rich addendum)

Before calling push_branch, verify EACH item:

- [ ] All files in the spec's `files:` block exist with non-empty
  meaningful content (not stub/placeholder).
- [ ] Tests added or updated for every modified function.
- [ ] `go build ./...` succeeds (or language equivalent).
- [ ] No new TODO/FIXME comments added.
- [ ] PR description summarizes ONLY changes present in `git diff`;
  do NOT mention intended, planned, or attempted changes that are
  not in the diff.

## Callsites — exhaustively-updated requirement

When changing an exported function signature, search the entire repo
for callsites via `grep_codebase` BEFORE pushing. Every caller
across cmd/, internal/, pkg/ must be updated to the new signature in
the same commit. Partial signature changes fail review.
```

This block is ~200 tokens. On non-reasoning models it adds explicit structure the model would otherwise skip. On reasoning models it adds redundant structure the model already runs internally, and the model now spends reasoning tokens enumerating the checklist instead of doing the work.

## Why the Asymmetry Holds

The hypothesis that best fits the data:

**Reasoning models do their own checklist internally.** When you crack open a Sonnet `<thinking>` block, it already contains "let me check the spec files, let me verify the test exists, let me re-read the callers". Adding an external checklist creates two effects: the model spends reasoning tokens reciting and answering it (on kimi-k2.6 tier-2 this was 12× output cost — $0.57 vs $0.047 — with no pass improvement); and external scaffolding can override the model's own better instincts.

**Non-reasoning chat models lack that internal step.** They produce tokens directly without an explicit planning channel. Without external structure they push without testing, change a signature without updating callers, write a PR body that doesn't match the diff. The d4-rich block forces them through the structure the reasoning models would have generated for themselves.

When asked "do you prefer rich or thin prompts?", kimi-k2.6 self-reported "rich/examples-heavy preferred". The matrix proved the opposite. Self-ask is not a reliable router (see `self-ask-trap-llm-introspection.md`); canary data is.

## Trade-Offs

| Combination | When it wins |
|---|---|
| **Non-reasoning + d4-rich** | Best $/pass when the spec is concrete and the model can absorb structure. deepseek-V4-Flash + d4-rich on a 100-dispatch heavy workload: $4 total vs $773 for sonnet. |
| **Reasoning + thin** | Best $/pass when the spec is fuzzy or requires novel design. Sonnet baseline absorbs ambiguity that flash drops; pay the premium where rate matters. |
| **Non-reasoning + thin** | Underperforming default. Most prompt deployments hit this combination by inheriting prompts from earlier reasoning-model setups. Pass rate cliff. |
| **Reasoning + d4-rich** | Active anti-pattern. Pay more, get worse outcomes, lose internal-planning quality. |

The dominant cost in autonomous fleets is the wrong combination silently shipping for weeks. A pool config with the wrong prompt is invisible until you re-canary.

## How to Test on a New Model

Don't trust self-report. Don't trust analogy from a sibling model. Run a 4-cell mini-matrix:

```
Cells:
  baseline-thin           — current/default prompt, no addendum
  baseline-plus-callsites — baseline + just the callsites rule
  d4-rich                 — full d4-rich addendum
  d4-rich-plus-callsites  — d4-rich with explicit callsites repeated

Per cell: N=3 minimum on a representative canary
Track:    pass_rate, $/run, output_tokens_median, defer_rate
```

The four cells let you distinguish three failure modes:
- Model already does callsites internally → baseline-plus-callsites ≤ baseline-thin
- Model needs callsites explicitly → baseline-plus-callsites > baseline-thin
- Model needs the full structure → d4-rich > both

Total cost on a heavy canary at N=3 across 4 cells is ~$5–$25 depending on the model. Cheap insurance against shipping the wrong prompt to a 100-dispatch workload.

## When the Asymmetry Might Not Apply

Three observed-but-unconfirmed exceptions:

1. **Single-file trivial fixes**. Both classes pass; prompt choice is noise. Don't burn matrix budget here.
2. **Open-ended research tasks**. Reasoning models with thin prompts may drift; mild scaffolding ("first read these files, then summarize") can help — not the d4-rich addendum.
3. **Models fine-tuned on checklist-following data**. We have no example yet, but this could re-invert the asymmetry. The rule is empirical, not architectural. Re-canary on every new model.

## Common Mistakes

**Copying a chat-model prompt to a reasoning-model deployment.** Team builds `CLAUDE-builder-heavy-fast.md` for deepseek-V4-Flash, ships it, six weeks later someone reuses the same prompt for kimi-k2.6 "because both are heavy-tier". Pass rate collapses; the team blames the model.

**Adding "let's think step by step" to a reasoning model.** The model already thinks step by step. Your instruction now competes with its internal plan and burns reasoning tokens on a meta-loop.

**Trusting self-reports about prompt preferences.** kimi-k2.6 self-said "rich/examples-heavy preferred". The matrix proved the opposite. Verify with a canary.

**Treating "more tokens in prompt" as obviously good.** For reasoning models, longer prompts cost in two layers: input tokens billed once, plus reasoning tokens spent re-processing the structure every turn. The 12× cost blow-up on kimi-prompt-rich tier-2 was almost entirely reasoning-channel inflation.

**Skipping the matrix because "it's just a prompt".** A $25 4-cell matrix amortizes against the first week of production dispatches.

## Putting It All Together

Before assigning a model to a builder pool:

1. Classify: reasoning channel or not? (Check provider docs — most reasoning models name it: thinking, reasoning_content, reasoning_effort, `<thinking>`.)
2. Pick the matching prompt baseline: thin for reasoning, d4-rich for chat.
3. Run a 4-cell matrix on your real canary at N=3.
4. Ship the winning cell to the pool config.
5. Re-canary on model upgrade.

The asymmetry has held for sonnet, kimi-k2.6, deepseek-V4-Pro/Flash, grok-4.3, and grok-4.20-reasoning. Stop budgeting time on prompt-tuning reasoning models; budget it on prompt-tuning the chat models that actually move on the data.

