---
title: "Heterogeneous A/B/C/D Pool Dispatch: Real Model Comparison Without an Eval Harness"
description: "Routing the same backlog item to N identical agent replicas running different LLM models, with first-to-ship winning via a shared accepted_by lock. Produces real cost-per-PR and quality-per-PR data on real work, replaces fragile eval harnesses with a continuous bake-off, and surfaces model-x-spec-shape pairings that static benchmarks miss."
url: https://agent-zone.ai/knowledge/agent-tooling/heterogeneous-pool-dispatch/
section: knowledge
date: 2026-05-18
categories: ["agent-tooling"]
tags: ["agent-pools","model-comparison","ab-testing","matched-spec-dispatch","cost-quality-tradeoff","fleet-architecture","llm-evaluation"]
skills: ["heterogeneous-pool-design","matched-spec-dispatch","cost-per-pr-analysis"]
tools: ["mcp","kubernetes"]
levels: ["advanced"]
word_count: 1820
formats:
  json: https://agent-zone.ai/knowledge/agent-tooling/heterogeneous-pool-dispatch/index.json
  html: https://agent-zone.ai/knowledge/agent-tooling/heterogeneous-pool-dispatch/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Heterogeneous+A%2FB%2FC%2FD+Pool+Dispatch%3A+Real+Model+Comparison+Without+an+Eval+Harness
---


You need to know whether `model-X` is worth deploying for your real workload. The benchmarks suggest yes, but benchmarks are static and your workload is not. The standard answer — build an eval harness — runs into two structural problems: harnesses are expensive to build well, and they tend to over-fit to the inputs you remembered to include in the corpus, missing the real production failure modes you discover only later.

Heterogeneous pool dispatch is a different shape. Instead of comparing models in a sandbox, you put N replicas of the same agent role into production, each running a different `(provider, model, prompt)` config, and route the same backlog item to all of them simultaneously. First replica to ship a working PR closes the item for everyone. The losers' work is discarded. You get real cost-per-PR and quality-per-PR data on real work, and the pattern naturally surfaces which models pair well with which spec shapes — something benchmarks rarely measure.

## The pattern in one diagram

```
        Backlog item (assigned_to: [replica-0, replica-1, replica-2, replica-3])
                                 │
              ┌──────────────┬───┴───┬──────────────┐
              ▼              ▼       ▼              ▼
      ┌─────────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐
      │ replica-0   │  │replica-1 │  │replica-2 │  │replica-3 │
      │ model: A    │  │model: B  │  │model: C  │  │model: D  │
      │ prompt: A'  │  │prompt: B'│  │prompt: C'│  │prompt: D'│
      └─────────────┘  └──────────┘  └──────────┘  └──────────┘
              │              │             │             │
              ▼              ▼             ▼             ▼
        runs spec       runs spec    runs spec     runs spec
        opens PR-A    defers       opens PR-C   still running
              │
              ▼
        first open_pr wins → set accepted_by=replica-0
        other replicas' next dispatch loop sees item already
        accepted → does not re-dispatch
```

Each replica is identical infrastructure (same pod template, same tool surface, same workspace mount) — only the model + prompt config differ. The dispatcher hands the same `item_id` to multiple replicas. The first to call `open_pr` (or whatever your "I shipped it" tool is) flips a shared field. The others' next cycle reads that field, sees they are no longer the canonical owner, and stands down.

You end up with real production data: which model shipped first, which deferred, which crashed, how much each cost. After a few weeks the rate card writes itself.

## What you actually learn from this

The output of a heterogeneous pool is a continuous comparison table updated by every real backlog item:

```
Last 7d, backlog dispatched to builder-medium pool (replicas 0..3):

replica   model           items   shipped   deferred   defers/dispatch   $/PR
─────────────────────────────────────────────────────────────────────────────
medium-0  xai/grok-fast    47       28         11         0.23          $0.71
medium-1  deepseek-chat    47        9         15         0.32          $4.10
medium-2  gemini-flash     47       19         18         0.38          $0.04
medium-3  moonshot/k2      47       12         24         0.51          $1.20
```

Three things stand out that you would not get from any benchmark:

1. **Real cost-per-shipped-PR**, not per-token or per-second. A model that emits 5× the reasoning tokens looks 5× worse on a tokens-per-task chart but might still ship 3× the PRs.

2. **Defer rate as a signal**. Some models defer eagerly on hard specs; others plow ahead and produce broken PRs. The defer rate per dispatch tells you which mode each model is in.

3. **Model-x-spec-shape pairings** emerge over time. One model wins consistently on single-file MCP tool implementations; another wins on multi-file refactors; a third wins on REQUIRED-FIX tightening. Static benchmarks rarely separate these.

You stop arguing about which model is "better" and start routing intelligently: send spec-shapes that pair well to the replica that handles them well.

## Implementation

### Pool config with per-replica overrides

The pool definition needs a default config plus per-replica diffs. A typical YAML shape:

```yaml
pools:
  builder-medium:
    replicas: 4
    heterogeneous: true            # don't apply env-clobber that would override per-replica
    runtime:
      max_tool_rounds: 16
      task_timeout_seconds: 600
      per_task_usd_cap: 0.50
    # default applies to every replica unless overridden
    provider: xai
    model: grok-fast
    prompt: prompts/CLAUDE-builder-medium.md
    replica_overrides:
      "1":
        provider: deepseek
        model: deepseek-chat
      "2":
        provider: gemini
        model: gemini-2.5-flash
      "3":
        provider: moonshot
        model: kimi-k2
```

The key implementation gotcha: most agent fleets inject `LLM_PROVIDER` and `MODEL` as pod-template environment variables. When you start using per-replica overrides, those env injections will *override* your override unless you defeat them explicitly. A `heterogeneous: true` flag (or equivalent) that suppresses the env injection for the affected pool is worth its own line in the chart.

### Dispatcher semantics

The dispatcher needs to do two things differently from a homogeneous pool:

**Fanout on initial assignment.** When the architect (or PM, or whatever your routing layer is) accepts an item for a heterogeneous pool, the item's `assigned_to` becomes a list of every replica, not just one:

```sql
UPDATE backlog SET status='accepted', assigned_to = ARRAY[
  'builder-medium-0', 'builder-medium-1', 'builder-medium-2', 'builder-medium-3'
] WHERE item_id = $1;
```

**Self-exclusion on already-accepted.** Each replica's poll loop should refuse to start working an item that is `in_progress` with a different `accepted_by`. The classic shape:

```sql
SELECT item_id, title, description FROM backlog
WHERE $1 = ANY(assigned_to)
  AND status IN ('accepted', 'in_progress')
  AND (accepted_by IS NULL OR accepted_by = $1);
```

The `accepted_by` field is the lock. Whoever calls `open_pr` first writes themselves into it. The next cycle on any other replica sees `accepted_by != self` and the row drops out of the candidate set.

### The first-to-ship flip

When a replica successfully calls `open_pr`, the runtime updates the shared row in one atomic statement:

```sql
UPDATE backlog
SET status = 'in_progress',
    accepted_by = $1
WHERE item_id = $2
  AND (accepted_by IS NULL OR accepted_by = $1);
```

`accepted_by IS NULL OR accepted_by = $1` is the optimistic concurrency. If two replicas race to call `open_pr` at the same time, one wins (the `UPDATE` returns 1 row affected), the other's statement returns 0 rows and the runtime backs off. The losing replica's PR is still in Gitea, just orphaned — close it as part of cycle cleanup or let it sit.

For most workloads the race is rare enough that you do not need to add explicit locking. Cycles fire every 30s or so; the window where two replicas are both about to call `open_pr` on the same item is small.

## What goes wrong

Three failure modes show up often enough to design around.

**All arms defer.** Some specs are genuinely too hard, ambiguous, or blocked. When every replica defers the same item, the item rotates back to the architect for triage — split the spec, add detail, or close as won't-fix. This is signal, not noise: if a heterogeneous pool can't ship a spec, the spec is the problem.

**One arm dominates so completely that data collection stops.** If `replica-0` ships 95% of items, you stop learning about the other replicas. Two responses, both valid: (a) accept that you have found the right model and collapse the pool to homogeneous, saving the cost of the losing arms; (b) deliberately route some classes of items only to non-winners to keep the comparison alive. Option (a) is right for cost-sensitive teams; (b) is right for teams that want continuous re-evaluation as models evolve.

**One arm crashes silently.** A model upgrade, an API change, or a config regression can take an arm from "deferring sometimes" to "never producing a PR ever". The defer-rate dashboard makes this visible — but you have to actually watch it. A weekly alert on "any replica with 0 ships in 24h" catches the silent-stop class.

## Operational tips

A few things that turn out to matter more than they sound like they will.

**Calibrate the cost rate card per model.** Vendor pricing changes; some models bill reasoning tokens as completion (full output rate), others bill them at a discount, others don't expose them at all. Without a calibrated rate card per model, your $/PR numbers are decorative. Update the card monthly or whenever a vendor announces pricing changes.

**Pin model tags.** `gpt-5` and `gpt-5-2026-04-01` may not be the same model in two weeks. Pin specific tags in the pool config; treat a model upgrade the same way you treat any other config change (commit, dispatch a few canary items, watch).

**Tag PRs with replica + model.** Every PR opened by a heterogeneous pool should carry metadata in the body or labels: `builder-medium-2 (gemini-2.5-flash)`. This makes the analysis cheap — `SELECT model, AVG(merge_time), AVG(required_fix_count) FROM prs GROUP BY model` writes itself when the metadata is there.

**Track defers as first-class outcomes, not failures.** A defer is the agent declining to ship rather than shipping a broken PR. That is a feature. Collapse the wrong way — counting defers as failures — and you'll discourage the behavior and end up with more REQUIRED-FIX cycles.

**Be careful about prompt diffs.** If `replica-1` has a different prompt as well as a different model, you are testing both at once. Sometimes that's the point (each model has slightly different prompt ergonomics). Sometimes it muddles the comparison. Decide explicitly which you are doing.

## When this pattern doesn't pay off

The heterogeneous pool is not free. Each additional replica costs you cycles, tokens, and infrastructure. The pattern only wins when:

- **Backlog volume is high enough** that fanout dispatch doesn't starve any single arm. With 4 replicas and 12 items per day, each arm averages 3 dispatches — fine. With 4 replicas and 2 items per day, you have no signal.
- **The agent role is well-defined** and the same spec is genuinely runnable by multiple models. A pool that includes a model the others can't compete against (because the spec format relies on capabilities only one supports) is just a homogeneous pool wearing a costume.
- **You actually look at the data**. The pattern produces a continuous dataset; if nobody reads the dashboard or rotates the routing decision, you're paying for parallel work and learning nothing.

For low-volume backlogs, a periodic sandbox A/B (fire the same spec at multiple models in a one-off comparison) gets you most of the value at less cost. For very tight-budget teams, single-replica pools are more cost-effective. The heterogeneous pool is right when you have enough work to amortize the comparison cost and a real reason to know which model handles your real workload best.

## A worked rollout

A reasonable adoption path for a team that has a homogeneous pool today:

1. **Pick the most-cost-aware pool first** (usually a "medium" or "tier-2" pool where unit cost matters most). Heavy/expensive pools tend to have less variance and less benefit from comparison.
2. **Add the `heterogeneous: true` flag** to defeat env-injection of the pool's default provider/model. Verify the per-replica override is taking effect (`kubectl exec replica-1 -- env | grep MODEL`).
3. **Start with 2 replicas** (your current model + one candidate). Run for a week. Look at the dashboard.
4. **Expand to 3-4 replicas** once you trust the data. Each new replica should test a hypothesis ("does deepseek win on multi-file refactors?"), not just be a new model.
5. **Collapse when you have an answer**. After a month or two you will have either confirmed your current model or found a winner. Collapse back to homogeneous (saving infrastructure cost) and start the cycle again with the next pool.

The point is not to run heterogeneous pools forever. The point is to convert "I think model-X might be better" into "model-X ships 1.4× the PRs at 0.6× the cost for our actual workload" — and then to act on the answer.