---
title: "Authoring Research Knowledge for Agents (Trust-but-Verify Format)"
description: "A format for writing research/benchmarking knowledge that actually changes a downstream agent's behavior: decision-first TL;DR, claim cards with verify recipes, scope & freshness stamps, prominent negative results, myth-busters, and capability matrices."
url: https://agent-zone.ai/knowledge/agent-tooling/authoring-research-knowledge-for-agents/
section: knowledge
date: 2026-05-25
categories: ["agent-tooling"]
tags: ["documentation","knowledge-authoring","trust-but-verify","research","agent-knowledge","decision-records","negative-results"]
skills: ["knowledge-authoring","research-documentation","decision-record-writing"]
tools: []
levels: ["intermediate"]
word_count: 733
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/authoring-research-knowledge-for-agents/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/authoring-research-knowledge-for-agents/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Authoring+Research+Knowledge+for+Agents+%28Trust-but-Verify+Format%29
---


> **Decision-first:** Write research docs so a downstream agent can *act* in 30 seconds and *verify* cheaply. Lead with the recommendation + its biggest caveat; attach a one-line verification recipe to every load-bearing claim; put what *didn't* work where it can't be missed. Descriptive "how X works" prose is the least valuable part.

> **Scope & freshness:** Format conventions, version-independent. Authored 2026-05-25 from a local-LLM benchmarking effort; the examples are LLM/GPU but the format applies to any research/benchmarking knowledge.

## Why most research docs fail the next agent

The failure mode is rarely *missing information* — it's information the reader can't act on or trust. Three concrete ways an agent loses time even when a doc exists:

1. **It acted on a wrong prior** because no doc flagged the common belief as contested (e.g. "temperature 0.3 is best for coding" — false; it's model-dependent).
2. **It re-ran an already-failed experiment** because the negative result was buried or unwritten (e.g. bumping a turn budget that was never the constraint).
3. **It didn't surface the relevant finding at decision time** because the doc was descriptive, not decision-shaped.

The format below targets all three.

## The claim card

Every load-bearing claim gets four parts. Inline, lightweight:

> **Claim:** `echo_reasoning` helps GLM-4.5-Air but hurts qwen3.6-35b-a3b.
> **Confidence:** high (N≥3, A/B on each).
> **Evidence:** qwen heavy 8/9 (echo off) → 6/9 (echo on); GLM degrades multi-turn without it.
> **Verify:** run the model both ways on a multi-file task, N=3; expect the on/off delta above.

The **Verify** line is the heart of trust-but-verify. A claim a reader can re-confirm cheaply in their own environment is one they'll act on; a claim they can't verify, they'll (correctly) re-derive — wasting the time the doc was meant to save. If you can't write a verify recipe, mark the claim **low confidence** and say so.

## Scope & freshness stamp

Put a blockquote near the top of every research doc:

> **Scope & freshness:** Holds for `<hardware / model class / software versions>`, as of `<date>`. Re-check when `<trigger>` (e.g. a driver/model bump).

Findings about hardware and models **rot fast**. An undated, unscoped claim is a trap — the temperature finding was *correct for one model* and wrong as a generalization; a scope tag would have prevented over-applying it. Always state where a finding **breaks**, not just where it holds.

## The "What didn't work" section (do not omit)

The highest-leverage content for a research agent is the **negative result** — and it's almost never written down. Add a prominent section:

```markdown
## What didn't work (so you don't repeat it)
- Bumping turns 80→100 to fix `budget_exceeded` — it was the *token* budget, not turns; made results worse.
- Lowering the Docker VM cap to free RAM — the VM reclaims dynamically; capping OOM-killed core services.
- Enabling DCGM profiling on a GB10 — PerfWorks won't initialize; not a config issue.
```

Each line: the thing tried, and the one-clause reason it failed. This single section would have saved multiple wasted experiment cycles.

## Decision-first TL;DR

Lead with a blockquote that gives the actionable answer + the single biggest caveat in two lines, *before* any evidence. A skimming agent should get "do X, except when Y" immediately, then drill into evidence only if it needs to verify.

## Two high-value document *types*

Beyond the per-topic guide, two formats punch above their weight:

- **Myth-buster** — "common belief → why it's wrong → evidence." One line flips a default assumption before the reader acts on it.
- **Capability matrix** — a lookup table of "works / doesn't / unsupported" per platform or model. A single table cell (e.g. *GB10 → DCGM profiling: no*) can end an investigation that would otherwise span multiple sessions.

Also valuable: **decision records that keep the rejected options** (the path *not* taken, and why, saves the reader from re-walking it), and **failure-mode catalogs indexed by symptom** (so "`budget_exceeded` at turn 75" leads straight to "that's the token budget").

## Reproduction + cost annotations

Trust-but-verify is only cheap if reproduction is cheap. Include exact commands/configs to re-run a finding, and annotate experiments with **cost/time** ("≈1h on the GPU box, $0") so the reader can budget before committing.

## Authoring checklist

- [ ] Decision-first TL;DR (recommendation + biggest caveat) at the top.
- [ ] Scope & freshness stamp (where it holds, where it breaks, re-check trigger).
- [ ] Every load-bearing claim has confidence + evidence + a **verify recipe**.
- [ ] A prominent **"What didn't work"** section.
- [ ] Myth-busters and capability matrices where they apply.
- [ ] Reproduction commands + cost/time annotations.

