DEV Community 2h ago

I built a multi-agent loop where an adversarial Claude reviewer reads your actual codebase before approving plans

Large language models are surprisingly optimistic reviewers. Ask an LLM to review an implementation plan and it will often approve things that are objectively wrong:

Non-existent file paths
Incorrect function signatures
Missing edge cases
Broken assumptions about the codebase
Incomplete testing strategies

The problem is simple: the model is reasoning from its training data and the conversation context, not from your actual repository. I wanted something different. I wanted a reviewer whose default assumption is that the plan is wrong, and whose job is to prove it. So I built agent-plan-review-loop, an open-source multi-agent orchestration system that repeatedly challenges implementation plans until they survive adversarial review.

The Core Idea

Most AI review workflows look like this:

Author generates a plan
Reviewer checks the plan
Reviewer approves the plan

The problem is that both agents often share the same context and reasoning chain. My approach intentionally breaks that connection. Every artifact is stored as a markdown file inside the repository:

Plan
Review
Questions
Decisions
Diffs

Each agent runs as a completely fresh process using Claude Code CLI. The reviewer has no access to the author's reasoning. It only sees:

The implementation plan
The actual repository
Its instructions

This forces the reviewer to evaluate the plan on its own merits rather than continuing the author's thought process. In practice, this catches a surprising number of mistakes.

The Workflow

The system runs an Author → Reviewer loop until approval.

Task
  ↓
Classifier
  ↓
Author
  ↓
Reviewer
  ↓
CHANGES_REQUESTED?
  ↓ Yes → Author revises
  ↓ No
APPROVED
  ↓
Coder implements

The reviewer is intentionally adversarial. Its primary instruction is:

You are a SKEPTICAL senior REVIEWER. Find why this plan will FAIL. Do not praise it. Default to CHANGES_REQUESTED; approve only if genuinely sound.

Instead of asking "what's good about this plan?", the reviewer asks:

Which assumptions are wrong?
Which files don't actually exist?
Which edge cases were missed?
Which APIs are being used incorrectly?
Which tests are missing?

The result is far more useful feedback than generic AI approval.

Complexity-Aware Model Routing

One challenge with agent systems is cost. Running the most expensive model for every task quickly becomes impractical. To solve that, I added a lightweight classification step using Haiku. Each task is categorized before planning begins:

Tier	Task Type	Author	Reviewer	Max Iterations
T0	Text, Config, CSS	Sonnet	Opus	3
T1	Small Feature	Sonnet	Opus	3
T2	Complex Refactor	Opus	Sonnet	6

This allows the system to reserve expensive reasoning for genuinely difficult work. Most routine tasks never need a full Opus planning cycle.

Handling Decisions AI Can't Make

One thing I strongly wanted to avoid was AI guessing business requirements. When the Author encounters a genuine product decision, it stops and asks. For example:

STATUS: NEEDS_ANSWERS
Q1: Should exports include archived items? A:
Q2: CSV, XLSX, or both? A:

The workflow exits. A human answers the questions. The process resumes exactly where it left off. This prevents the system from inventing requirements simply to keep moving forward.

Isolated Implementation

Once a plan is approved, a separate Coder agent performs the implementation. The implementation never touches the developer's active working tree. Instead, it creates an isolated Git worktree:

REPO="$PWD" bash code-run.sh TASK-42

The result is:

Dedicated branch
Isolated workspace
Generated diff
Human review before merge

This makes experimentation much safer than allowing AI to modify a live working directory.

Pluggable Deployment

Deployment is intentionally simple. Users define their own validation and deployment commands:

GATE_CMD='bash laravel-gate.sh' \
DEPLOY_CMD='bash ship.sh' \
bash deploy-run.sh TASK-42

If validation fails, the merge is automatically rolled back. The framework doesn't assume anything about your stack.

The Unexpected Discovery

The most important architectural decision turned out to be the simplest one: Files are a better memory system than conversations. When the reviewer starts from scratch and reads markdown artifacts instead of inheriting conversation history, it becomes dramatically more critical. It behaves less like a second opinion from the same person and more like an independent engineer joining the review process for the first time. That independence is exactly what makes the feedback valuable.

Mobile Workflow

I also added an optional Telegram bot. It allows:

Answering review questions
Providing steering notes
Managing multiple tickets
Monitoring progress remotely

This turned out to be surprisingly useful when away from a laptop.

Example Usage

Generate and review a plan:

REPO="$PWD" bash plan-loop.sh \
  TASK-1 \
  "add CSV export to reports page"

Implement an approved plan:

REPO="$PWD" bash code-run.sh TASK-1

Open Source

GitHub: https://github.com/execute25/agent-plan-review-loop

The project requires:

Claude Code CLI
Bash
Git

I'm particularly interested in feedback from people building AI coding agents, autonomous development workflows, or review systems. What would you change in this architecture?

Read on DEV Community ↗ ← Back to News