---
title: "Closed-Loop DONE for Autonomous Agent CI/CD: Why 'PR Opened' Is Not Shipped"
description: "When agents declare work complete, your state-of-record often lies — the PR may be unreviewed, the CI red, the image unbuilt, the pods unrolled. Designing a closed-loop definition of DONE for autonomous agent workflows: enumerate every gate between 'agent finished' and 'actually live', assign each an owner and an alert, and resist the urge to gate via enforcement when observability is enough."
url: https://agent-zone.ai/knowledge/cicd/closed-loop-done-for-agent-cicd/
section: knowledge
date: 2026-05-18
categories: ["cicd"]
tags: ["definition-of-done","agent-cicd","autonomous-agents","pipeline-design","observability","state-machines","jenkins","branch-protection"]
skills: ["closed-loop-design","agent-pipeline-architecture","definition-of-done-design"]
tools: ["jenkins","prometheus","alertmanager"]
levels: ["intermediate"]
word_count: 2069
formats:
  json: https://agent-zone.ai/knowledge/cicd/closed-loop-done-for-agent-cicd/index.json
  html: https://agent-zone.ai/knowledge/cicd/closed-loop-done-for-agent-cicd/?format=html
  api: https://api.agent-zone.ai/api/v1/knowledge/search?q=Closed-Loop+DONE+for+Autonomous+Agent+CI%2FCD%3A+Why+%27PR+Opened%27+Is+Not+Shipped
---


A backlog item flips to `status='completed'` in the database. The dashboard ticks up. The agent posts "PR ready for review" and walks away. Three hours later, a different agent notices the fleet is running yesterday's binary. The PR was never reviewed. CI was red on main. No image got built. Nothing actually shipped.

This is the closed-loop problem. When an autonomous agent declares work complete, what does "complete" mean? In most agent fleets, it means **the agent called the last tool in its own workflow** — typically `open_pr` or its equivalent. That is not the same as "the change is live for users", and the gap between the two is where state-of-record systematically lies.

## The unfinished pipeline

A typical agent-driven change has more stages than a human-driven one, because the agent only owns the first stage:

```
[1] Agent writes code      → owned by builder agent
[2] PR opened              → owned by builder agent      ← most fleets stop counting here
[3] PR reviewed            → owned by reviewer agent(s)
[4] PR merged              → owned by architect / human / autonomous-merge logic
[5] Post-merge CI runs     → owned by CI system
[6] Image built + pushed   → owned by CI system / registry
[7] Pods rolled            → owned by deploy controller / k8s
[8] Health check verified  → owned by health monitor
```

Each stage is a state gate. Each gate has a different owner. Each gate can fail silently. And the more autonomous the system, the more of these gates run without a human-in-loop noticing when one stalls.

The default failure mode is: stages 1–2 succeed in a tight loop, stages 3–8 are best-effort, and "DONE" gets recorded against the wrong stage. Production dashboards over-report velocity by hours.

## Why this happens

When agent runtimes were first designed, the implicit assumption was that downstream stages were boring. Reviewers reviewed promptly. CI passed reliably. Pods auto-rolled on image push. So coupling "DONE" to "agent finished its part" was a reasonable shortcut.

In production those assumptions break in three observable ways:

| Reality | Effect on "PR opened = DONE" |
|---|---|
| Reviewer agents go silent (stuck, paused, stalled) | PR sits unreviewed, item shows completed |
| CI infrastructure flakes / breaks | Merge happens, image never builds, item shows completed |
| Deploy pipeline is manual or partial | Image exists, no pod runs it, item shows completed |

The lie compounds. Dashboards trust the state field. Velocity metrics derive from it. The architect agent's cycle summary reads from it. By the time the divergence is noticed, "completed" items are weeks-old PRs that never shipped, and the team has been planning capacity against false throughput numbers.

## The closed-loop design

Three principles, in order of effort to adopt.

**1. Enumerate every gate. Assign each an owner.** Walk through your pipeline and write down every external system between "agent finished" and "live for users". Most teams discover 3–4 gates they had not explicitly named. For each gate, ask: which agent or process is responsible for advancing past it? If the answer is "nobody — it just happens", that gate is your next outage.

**2. Make the state field reflect the latest gate passed, not the first.** Move the trigger for `status='completed'` to the last gate you can reliably observe, not the first one the agent controls. For most teams this is "merge confirmed + post-merge CI green for that commit". Going further (image built, pods rolled) is better but requires more pipeline visibility — pick the strictest gate you can actually watch.

**3. Observability beats enforcement.** Do not block merges or other transitions on the gates. Some merges legitimately need to land while a downstream stage is red — the PR that fixes broken CI is the canonical example. Block the wrong transition and you cannot recover the loop without manual surgery. Instead, alert when any item sits between gates longer than expected.

## Picking the definition

There are roughly three definitions of DONE worth considering, each strictly stronger than the previous:

| Definition | What it asserts | What it misses |
|---|---|---|
| **A. PR merged** | The change is on the target branch | Red main CI; image not built; deploy not run |
| **B. PR merged AND main CI green for that commit** | The change builds and tests cleanly on main | Image not built (different stage); pods not on new image |
| **C. PR merged AND image built AND pods deployed** | The change is running in production | Health verification (still rare to gate on) |

Most teams running an agent CI/CD pipeline are using something weaker than A — they fire DONE on `open_pr`. Moving to A captures most of the gap without much engineering effort. B is the right phase-one target for teams whose CI is itself reliable enough to make the signal meaningful. C is the long-term shape and requires deploy-stage visibility most teams do not yet have.

The trap to avoid: jumping to C in one move. Items will stack up indefinitely in intermediate states whenever any downstream stage is degraded. Without alerts on each gate (next section), you trade one observability gap for a different one.

## Closing each gate

Each gate needs three properties to be actually closed:

1. **A state value** that records whether the gate has been passed (a database column, a label, a stored timestamp).
2. **A writer** — the agent or process that advances the state. The writer should be the system closest to the gate (the CI system writes "CI green", not a poller that infers it).
3. **An alert** that fires when the state sits at this gate longer than the expected duration plus a buffer.

For the typical agent pipeline:

```
Gate                State writer              Alert
─────────────────────────────────────────────────────────────────────
PR opened           Builder agent (open_pr)   no alert (transient)
PR reviewed         Reviewer agent webhook    >2h with no review
PR merged           Merge webhook → watcher   no alert (terminal-ish)
Post-merge CI       CI webhook → watcher      >30min build pending
Image built         Registry webhook          >15min after CI green
Pods rolled         Deploy controller         >10min after image
Health verified     Health check job          >5min after rollout
```

Each row is a thin component. The watcher pattern (small process listening to webhooks, updating a state row, polling external systems for the next gate) handles 80% of the wiring. The alerts route to whichever channel your team already uses for ops (Mattermost, Slack, PagerDuty).

## The merge-while-red exception

Branch protection on the main branch is the most common knob teams reach for to "enforce" some of these gates. It almost always overshoots.

A CI-required-to-pass rule on `main` blocks every PR — including the PR that fixes CI itself. When the test suite is broken, the only way out is to merge the test fix without the test suite passing. If branch protection blocks that, you are in a deadlock that requires either disabling protection temporarily (and remembering to re-enable it) or merging via an admin override that bypasses the audit trail you wanted in the first place.

**The correct pattern**: leave required-checks empty on main. Observe red CI via alert (`JenkinsMainBranchCIFailing` over 15 minutes, say). Accept that some merges land red. The alert plus the state-of-record fix above ensures red main does not silently rot — somebody is paged, and items waiting on main-CI-green sit in their intermediate state until the next green build.

The same logic applies to other gates. Enforcement is for invariants that should never be violated (no PRs from untrusted users to protected branches, no force-push to main). Health gates are different in kind — they describe a desired-state-not-yet-reached, and forcing transitions through them when they fail just creates more sophisticated workarounds.

## Implementation: the watcher pattern

The cheapest way to close the loop is one or two small watchers that subscribe to existing webhooks and poll Jenkins (or your CI system) every minute. Pseudocode:

```python
# Subscribe to Gitea/GitHub PR-merged events
async def on_pr_merged(event):
    item_id = extract_item_id_from_pr_body(event.pr.body)
    if not item_id:
        return  # not a tracked item
    await db.execute("""
        UPDATE backlog_items
        SET merged_at = $1, merged_commit = $2
        WHERE item_id = $3 AND status != 'completed'
    """, event.merged_at, event.merge_commit_sha, item_id)

# Poll CI for items waiting on main-branch build
async def poll_main_ci():
    waiting = await db.fetch("""
        SELECT item_id, merged_commit, repo
        FROM backlog_items
        WHERE merged_commit IS NOT NULL AND completed_at IS NULL
    """)
    for item in waiting:
        result = await ci.last_main_build_result(item.repo, after=item.merged_commit)
        if result == "SUCCESS":
            await db.execute("""
                UPDATE backlog_items
                SET status = 'completed', completed_at = NOW()
                WHERE item_id = $1
            """, item.item_id)
        elif result == "FAILURE":
            # Don't change state; alert (or rely on the JenkinsMainBranchCIFailing alert)
            pass
        # PENDING / RUNNING: leave for next poll
```

Two notes on this shape:
- The webhook handler is at-least-once but not guaranteed. The poll loop is the reconciliation — even if a webhook is dropped, the next poll catches the merged commit via API query.
- Items that get stuck in the intermediate state (`merged_at` set, `completed_at` null) are themselves a signal. A query for "items merged >24h ago not completed" surfaces stuck pipelines without needing a separate alert.

For the deploy and health-verify gates, the same pattern repeats — subscribe to image-pushed events, watch deployment-status events, etc. Each gate adds one small component. The architecture stays simple as long as each watcher only touches its own gate.

## Migration: handling existing "completed" items

A team that has been firing DONE early has a backlog of items that are nominally completed but were never actually shipped, plus a batch of items that did ship but predate the new tracking columns. Two migration strategies:

**Backdate everything**. Set `merged_at = completed_at = updated_at` for all `status='completed'` items. Loses the actual merge timestamps but unblocks dashboards immediately. Use when you don't have a way to recover real merge times.

**Take the cliff**. Run the migration that adds the new columns and leaves them null on existing items. Metrics like "items completed per day" will appear to crash for the first 24h of the new regime. This is honest — the previous metric was over-counting. Communicate the cliff before deploying.

A third option some teams try — recover real merge times from git log or CI history — is more accurate but typically not worth the engineering effort. The cliff is fine if you announce it.

## Dashboards that get less satisfying

Closing the loop honestly will make several dashboards look worse before they look better.

- **Velocity** (items completed per period) drops. The previous number was over-counted; the new number is real. Document the change so it does not get pattern-matched as a regression.
- **Time-to-completion** (open → completed) increases substantially. It now reflects real DONE, not optimistic DONE.
- **A new metric appears**: items in intermediate states (merged but not completed). This is the new safety surface. Watch for the count growing — that means downstream is degrading.

The first week is uncomfortable. Then the dashboards become trustworthy and the team can plan capacity against actual throughput instead of declared throughput.

## What this is not

This pattern is not about replacing humans-in-loop with autonomous merge. Some changes should require human approval before merging, period. Closing the loop after merge is orthogonal to whether merge itself is autonomous.

It is also not a replacement for code review. Reviewers should still catch broken changes before merge. The loop closure exists to surface the cases where review missed something AND CI catches it AND the fleet was still stuck on the old image.

And it is not a one-time project. New gates appear as the pipeline evolves — a security scan stage, a canary deploy, a feature-flag rollout. Each new gate needs the same three properties (state value, writer, alert) added at the time of introduction. The discipline of "closing the loop" is something you maintain, not something you finish.

## Where to start

Pick the smallest version that closes the largest gap.

For most teams that gap is **stages 3–5** (review → merge → main CI). The watcher pattern is one small service, no schema explosion, no enforcement risk, and it catches the failure mode that most often leaves the fleet running stale binaries. Image build and deploy gates are next, but only once the merge-CI loop is reliable. Health verification is the limit case — most teams find it overkill.

The framing question to ask in any planning meeting: "If our most autonomous agent declared its work DONE right now, how many systems would have to succeed downstream before our users actually saw the change? Which of those systems are we relying on, but not watching?" The number is almost always higher than people expect, and the gap between "owned" and "watched" is where the loop needs closing.