DEV Community

Our AI autopilot merged nothing for 3 days - the culprit was a review 'freshness' check

Background: the wait_review step

Our workflow engine drives each task through fixed steps like implementcreate_prwait_ciwait_reviewmerge_pr. We learned early on that letting the AI decide "what to do next" makes operations impossible to reason about, so transitions are hard-coded in the engine. wait_review is a rendezvous point: it polls GitHub PR reviews and advances once approvals are in.

Inside it, there was this check:

# simplified
fresh_reviews = [r for r in pr_reviews if r.submitted_at > workflow.review_initiated_at]
if any(r.state == "APPROVED" for r in fresh_reviews):
    advance_to_merge()

Only look at reviews newer than review_initiated_at. Seems reasonable.

Why the freshness check existed

It was added to prevent a real incident:

  • A reviewer APPROVES
  • The agent pushes follow-up commits
  • The reviewer submits CHANGES_REQUESTED
  • The poll picks up the stale APPROVE and merges anyway

Merging on an outdated approval after changes were requested is clearly bad, so we added "only trust reviews newer than when we started waiting." At the time, this was the correct fix.

Re-arming kills standing approvals

The problem was when review_initiated_at gets updated. The engine has a recovery path for Spot instance reclamation and process restarts: it picks up interrupted workflows and re-enters the current step. Re-entering wait_review re-armed review_initiated_at to the current time - regardless of whether any new commits existed.

Which makes this timeline possible:

  • 10:00 - Reviewer APPROVES (submitted_at = 10:00)
  • 10:30 - Spot reclamation → workflow recovery re-enters the step → review_initiated_at re-armed to 10:30
  • 10:31 - Poll: submitted_at(10:00) > review_initiated_at(10:30) is false → "no new reviews"

Since the reviewer has already approved, no new review will ever arrive → the workflow waits forever. The approval still exists on GitHub, plain as day - a standing approval. But the filter's reference time moved past it, so the engine forever concludes "no review yet." A perfect deadlock.

The nasty part: it's probabilistic. If recovery re-entry happens before the approval, nothing goes wrong. Only workflows where re-entry lands after the approval silently freeze. On Spot-heavy infrastructure, re-entries are frequent enough that these accumulated over days, and only surfaced as "zero dones."

How we actually diagnosed it

Getting from "Autopilot is down" to the root cause came down to one habit: don't look at the endpoint, look at the distribution.

  • Watching only the done count (the endpoint), everything looks uniformly stuck
  • Aggregating in-flight workflows by step showed 28 abnormally parked at wait_review
  • Cross-referencing their PRs: all approved, CI green, unmerged

At that point "the engine can't see approvals" is established, and the rest is just reading the poll code. Counting where the bodies pile up beats hunting for a dead component - because there was no dead component. A live process faithfully executing a wrong rule will never show up on a health check.

The fix: evaluate a snapshot, not events

The essence of the fix fits in one sentence: instead of asking "did a new review arrive?", compute "what is the effective review state right now?" on every poll.

# keep only each reviewer's latest review
latest_by_reviewer = {}
for r in sorted(pr_reviews, key=lambda r: r.submitted_at):
    latest_by_reviewer[r.user.login] = r
effective_states = [r.state for r in latest_by_reviewer.values()]
if "CHANGES_REQUESTED" in effective_states:
    hold()
elif "APPROVED" in effective_states:
    advance_to_merge()

This is literally the semantics GitHub's own UI uses: the review panel shows each reviewer's latest state, and when the approval happened plays no role in the effective verdict. Which lands on an unglamorous but solid conclusion: if you gate on an external system's state, adopt that system's own semantics.

What about the original incident (merging on a stale approval after changes were requested)? Per-reviewer-latest handles it for free: the reviewer who requested changes is in CHANGES_REQUESTED state, so an older APPROVE can never win. The freshness check survives only in its legitimate, narrow role - debouncing re-review after the same reviewer requested changes and new commits landed.

After deploying, the wait_review backlog dropped from 28 to 8, and 15 tasks completed in 25 minutes - three days of approved PRs draining the moment the predicate was fixed.

Wave two: failures must propagate

We thought that was the end. The next day, the same symptom came back - different cause: dependency chains. Some tasks had been marked failed at wait_review during the bug window, and their downstream tasks sat blocked on those dependencies. The auto-recovery job only did "move blocked back to pending," while the scheduler's start condition remained "all dependencies done" - so tasks with a failed ancestor bounced between pending and blocked forever. A no-op retry loop.

The fix was to go fail-fast: when a dependency fails, don't leave descendants ambiguously blocked - explicitly mark them FAILED and propagate it downstream. A visible failure gets fixed with one action ("retry the failed ancestor"). An invisible blocked task never gets retried, because nobody knows it's waiting.

Takeaways

These generalize to any system that polls external state at a rendezvous point, AI agents or not:

  • Evaluate external state as a snapshot, not as events. Gating on "anything newer than time T?" means one mistake in managing T permanently drops facts that are already true (standing approvals). Recomputing "the effective state right now" is idempotent and recovery-proof.

  • Re-arming a timestamp means erasing every fact before it. If retry/recovery/re-entry paths casually set *_initiated_at = now(), everything established before that instant vanishes behind the filter. Re-arm only when the premise actually changed - e.g., new commits landed.

  • Yesterday's safeguard is today's prime suspect. The freshness check was a correct response to a real incident - its scope was just too broad. When you add a defense, also write down what stops working if this defense misfires; you'll check it first next time something stalls.

  • Diagnose "it's stuck" by step-level backlog distribution, not endpoint metrics. All-green health checks can't detect a live process that's wrong on purpose. Counting how many items are parked at each rendezvous point finds it in minutes.

  • Propagate failures. A dependency failure left as "blocked" is an unbounded silent wait. Fail-fast propagation turns recovery into a single retry.

If your autopilot includes auto-merge, "three days of zero dones" isn't an inconvenience - it's a production outage for your development line. Don't trust the green health checks; put rendezvous-point backlogs on your dashboard. That's the monitoring this incident permanently added to ours.

Comments

No comments yet. Start the discussion.