---
title: "The d4-rich Prompt Pattern: Unlocking Non-Reasoning Models on Multi-File Tasks"
description: "A three-part prompt addendum (completion checklist + callsites-exhaustively-updated rule + verify-before-push) reliably unlocks pass-rate on non-reasoning chat models. Same pattern catastrophically hurts reasoning models. Backed by three matrices and a concrete cost+quality table."
url: https://agent-zone.ai/knowledge/agent-tooling/prompt-rich-pattern-non-reasoning-models/
section: knowledge
date: 2026-05-20
categories: ["agent-tooling"]
tags: ["prompt-engineering","deepseek","grok","kimi","matrix-testing","non-reasoning-models","tool-use"]
skills: ["prompt-design","model-selection","scaffolding-pattern-design"]
tools: ["deepseek","grok","kimi","claude"]
levels: ["intermediate","advanced"]
word_count: 1423
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/prompt-rich-pattern-non-reasoning-models/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/prompt-rich-pattern-non-reasoning-models/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=The+d4-rich+Prompt+Pattern%3A+Unlocking+Non-Reasoning+Models+on+Multi-File+Tasks
---


# The d4-rich Prompt Pattern

Non-reasoning chat models (deepseek-V4-Flash, grok-4.3, kimi with thinking disabled) collapse on multi-file refactor tasks when given thin or baseline prompts. Pass rates of 0-33% on canaries that reasoning models clear at 67-100%. The cheap fix is a three-part prompt addendum: completion checklist, callsites-exhaustively-updated rule, and verify-before-push instruction. Drop it into the system prompt of a non-reasoning model and the canaries go green. Drop it into a reasoning model and you pay 12× more for 0% quality improvement.

This is the d4-rich pattern. It's load-bearing for the heavy-fast tier; it's actively harmful for the reasoning tier. Knowing which to apply is the whole job.

## TL;DR — for prompt engineers

- **Three additions**: heavy-tier scope directive (don't defer on multi-file alone) + completion checklist (verify each item before push_branch) + callsites-exhaustively-updated rule (grep before pushing signature changes)
- **Use on**: non-reasoning chat models doing multi-file work. deepseek-V4-Flash, grok-4.3, kimi with thinking disabled.
- **Do NOT use on**: reasoning models. kimi-k2.6 with thinking on, deepseek-V4-Pro, grok-4-reasoning. They already do this internally; the scaffolding is redundant and catastrophically expensive.
- **Effect sizes** (real matrix data): deepseek-V4-Flash 33% → 100%; grok-4.3 0% → 40%; kimi-k2.6 67% → 0% (with 12× cost increase); deepseek-V4-Pro 100% → 67% (with 60% cost increase)
- **Why it works on chat models**: external scaffolding substitutes for the model's missing internal "verify before commit" step. Chat models follow checklists literally.
- **Why it fails on reasoning models**: the scaffolding fights the model's own reasoning trace, consumes context, and confuses the tool-call schedule

## Problem

Non-reasoning chat models perform well on focused single-turn tasks but fail predictably on multi-file work:

- **deepseek-V4-Flash baseline**: 33% pass on heavy canary (refactor 5+ files, run tests, open PR). Common failure: ships a PR with one file modified, ignores the other four, claims completion.
- **grok-4.3 baseline**: ~0% pass. Defers with tier-2 scope-bleed wording ("complex multi-repo task, escalating"). The defer language was inherited from tier-2 templates and misapplied to heavy specs that ARE meant to be multi-file.
- **kimi with thinking disabled**: same pattern — defers or partial-completes.

These models can complete the task. They don't know they're supposed to verify completion before pushing, and don't know "5+ files" is the assignment rather than a warning.

## Pattern (d4-rich)

Three additions to the model's base prompt. Drop them in as separate sections; they compose well.

```markdown
## Heavy-tier scope (DO NOT defer on this alone)

Multi-file changes ARE the heavy-tier mandate. A spec listing 5+ files
across multiple modules is your assignment, not a warning. Do not
defer with phrases like "complex multi-repo" or "exceeds single-cycle
scope" — those are tier-2 defer patterns and don't apply here.

DO defer on:
- Named blockers: missing file, ambiguous spec line, compile error you
  can't resolve after 2-3 attempts
- Acceptance criteria that genuinely don't fit the runtime

DO NOT defer on:
- "Multi-file scope" alone
- "Risk of incomplete status" — incomplete IS still useful; ship it
  and let the reviewer iterate
- "Complex spec" — every heavy-tier spec is complex by design

## Completion checklist (kimi-rich addendum)

Before calling push_branch, verify EACH item:

- [ ] All files in the spec's `files:` block exist with non-empty
  meaningful content (not stub/placeholder)
- [ ] Tests added or updated for every modified function
- [ ] `go build ./...` succeeds (or language equivalent)
- [ ] No new TODO/FIXME comments added
- [ ] PR description summarizes ONLY changes present in `git diff`;
  do NOT mention intended, planned, or attempted changes that are
  not in the diff

## Callsites — exhaustively-updated requirement

When changing an exported function signature, search the entire repo
for callsites via `grep_codebase` BEFORE pushing. Every caller across
cmd/, internal/, pkg/ must be updated to the new signature in the
same commit. Partial signature changes fail review.
```

Each block targets a chat-model failure mode:

1. **The scope directive** stops tier-2-prompt-bleed defers. grok-4.3 otherwise reads "5 files across 3 modules" and applies the tier-2 instruction "if it looks complex, escalate". The directive invalidates that for heavy work.

2. **The completion checklist** is load-bearing. Chat models lack an internal "did I actually do what I said?" step. They write code, write a PR body describing intended changes, push, call it done — even when the file is empty. The checklist forces an explicit verification phase.

3. **The callsites rule** stops the most common heavy-tier review failure: changing a function signature in `internal/foo/foo.go` and not updating the four callers in `cmd/`, `internal/bar/`, `pkg/baz/`. Reasoning models grep naturally; chat models don't.

## Why it works on chat models, fails on reasoning models

Non-reasoning models generate forward, predict the next token, and stop at a stop sequence. They don't have a "wait — did the previous tool call actually do what I claimed?" check unless the prompt forces one. External scaffolding substitutes for the missing internal step. Chat models follow it literally.

Reasoning models already do completion checking internally. Adding d4-rich:

- **Duplicates the check**. Model runs its internal check, then re-runs the external checklist on the same items. Cost goes up; output doesn't change.
- **Fights the reasoning trace**. The model may have concluded "files A and B done, pushing"; the external checklist forces re-derivation, sometimes reaching a different (more cautious) conclusion. Quality drops.
- **Consumes context**. The 500-token addendum displaces 500 tokens of working memory. On tight budgets (kimi-k2.6 fills 100K+ tokens of thinking on heavy tasks), this matters.

The kimi-k2.6 matrix is the clearest case: baseline 67% pass at ~$0.50/run; d4-rich 0% pass at ~$6/run. Same model, same canary — 12× cost and worse output.

## Comparison (real matrix data)

| Model              | Type     | Baseline | + d4-rich | Effect                |
|--------------------|----------|----------|-----------|-----------------------|
| deepseek-V4-Flash  | chat     | 33%      | 100%      | unlock                |
| grok-4.3           | chat     | 0%       | 40%       | unlock                |
| kimi-k2.6          | reasoner | 67%      | 0% (+12× $$) | catastrophic       |
| deepseek-V4-Pro    | reasoner | 100%     | 67% (+60% $$) | hurts              |

Pattern is clean: chat models benefit, reasoning models are hurt. No "neutral" case in the data.

## When to use, when not to

**Use** when building a non-reasoning-model builder pool — e.g. `builder-heavy-fast` running deepseek-V4-Flash, or `builder-medium-*` arms running grok-4.3 / kimi-with-thinking-off. Bake d4-rich into the pool's system prompt (`agents-builder/CLAUDE-builder-heavy-fast.md` is the production example).

**Use** when iterating on a chat-model agent failing multi-file canaries. Start with d4-rich applied; if pass rate is still low, the next move is spec quality (concrete files, binary acceptance) — not more prompt scaffolding.

**Don't use** with a reasoning model. If unsure whether your model is reasoning or chat, run the canary both ways for N=5 and look at the cost+pass-rate table. The signal is unambiguous.

**Don't combine** with "be concise". The two instructions contradict; chat models follow both literally and produce contradictory behaviour. Pick one.

**Don't use** on tier-2 pools (single-file specs). The scope directive invalidates tier-2 defer patterns — applying it to tier-2 removes a real safety check.

## Tie checklist items to actual tool calls

The checklist as written is text-only — the model can claim it ran the check without actually running anything. The strongest version adds a `verify_before_push` tool the runtime exposes:

```
verify_before_push(repo, files=[...], requires_tests=true) → {
  files_present, tests_present, build_passes, no_new_todos
}
```

The model must call this before `push_branch`; the runtime refuses to dispatch `push_branch` if `verify_before_push` hasn't fired in the same turn. Without it, the checklist is honor-system — which works on chat models, but ~10-20% worse than tool-enforced checks.

## Common mistakes

**Copying d4-rich into a reasoning model's prompt.** Most expensive mistake. kimi-k2.6 with d4-rich is canonical: same task, 12× cost, 0% pass rate. Always run N=3-5 comparison before deploying.

**Mixing "be concise" with d4-rich.** Contradictory; chat models follow both literally and defer with "I was told to be concise so I'll skip the checklist". Pick one instruction.

**Not tying checklist items to tool calls.** Honor-system checklists work but ~10-20% worse than enforced ones. Add a `verify_before_push` tool when stakes are high.

**Applying d4-rich uniformly across a heterogeneous pool.** A/B/C/D pools (e.g. `builder-medium-{0..3}`) need per-replica prompts. d4-rich on the grok arm; baseline on the kimi arm. Use per-replica config overrides.

**Treating pass-rate jump as the whole story.** d4-rich also changes cost. Chat models: ~30-50% cost increase for a 60-100pp pass-rate increase. Reasoning models: 60-1200% cost increase for negative pass-rate change. Optimize cost-per-pass, not pass rate alone.

## References

- The production text comes from the Dream Team agent-harness matrix generator scripts (gen-kimi-matrix-configs.sh, gen-deepseek-matrix-configs.sh) which produce config cells that flip exactly one variable vs baseline
- Production prompt embedding: `agents-builder/CLAUDE-builder-heavy-fast.md` — the heavy-fast pool's system prompt, where the directive + checklist are baked in
- Matrix infrastructure: `~/projects/agent-harness/` — runs N=3 trials per config cell across three tier-3 canary shapes; results aggregate to the pass-rate numbers in the comparison table above
- Companion patterns: `cost-per-pass-not-cost-per-call.md` (the metric to optimize), `reasoning-model-tuning-asymmetry.md` (more on the chat-vs-reasoning prompt split)

