I built a multi-agent loop where an adversarial Claude reviewer reads your actual codebase before approving plans
Large language models are surprisingly optimistic reviewers. Ask an LLM to review an implementation plan and it will often approve things that are objectively wrong:
- Non-existent file paths
- Incorrect function signatures
- Missing edge cases
- Broken assumptions about the codebase
- Incomplete testing strategies
The problem is simple: the model is reasoning from its training data and the conversation context, not from your actual repository. I wanted something different. I wanted a reviewer whose default assumption is that the plan is wrong, and whose job is to prove it. So I built agent-plan-review-loop, an open-source multi-agent orchestration system that repeatedly challenges implementation plans until they survive adversarial review.
The Core Idea
Most AI review workflows look like this:
- Author generates a plan
- Reviewer checks the plan
- Reviewer approves the plan
The problem is that both agents often share the same context and reasoning chain. My approach intentionally breaks that connection. Every artifact is stored as a markdown file inside the repository:
- Plan
- Review
- Questions
- Decisions
- Diffs
Each agent runs as a completely fresh process using Claude Code CLI. The reviewer has no access to the author's reasoning. It only sees:
- The implementation plan
- The actual repository
- Its instructions
This forces the reviewer to evaluate the plan on its own merits rather than continuing the author's thought process. In practice, this catches a surprising number of mistakes.
The Workflow
The system runs an Author → Reviewer loop until approval.
Task
↓
Classifier
↓
Author
↓
Reviewer
↓
CHANGES_REQUESTED?
↓ Yes → Author revises
↓ No
APPROVED
↓
Coder implements
The reviewer is intentionally adversarial. Its primary instruction is:
You are a SKEPTICAL senior REVIEWER. Find why this plan will FAIL. Do not praise it. Default to CHANGES_REQUESTED; approve only if genuinely sound.
Instead of asking "what's good about this plan?", the reviewer asks:
- Which assumptions are wrong?
- Which files don't actually exist?
- Which edge cases were missed?
- Which APIs are being used incorrectly?
- Which tests are missing?
The result is far more useful feedback than generic AI approval.
Complexity-Aware Model Routing
One challenge with agent systems is cost. Running the most expensive model for every task quickly becomes impractical. To solve that, I added a lightweight classification step using Haiku. Each task is categorized before planning begins:
| Tier | Task Type | Author | Reviewer | Max Iterations |
|---|---|---|---|---|
| T0 | Text, Config, CSS | Sonnet | Opus | 3 |
| T1 | Small Feature | Sonnet | Opus | 3 |
| T2 | Complex Refactor | Opus | Sonnet | 6 |
This allows the system to reserve expensive reasoning for genuinely difficult work. Most routine tasks never need a full Opus planning cycle.
Handling Decisions AI Can't Make
One thing I strongly wanted to avoid was AI guessing business requirements. When the Author encounters a genuine product decision, it stops and asks. For example:
STATUS: NEEDS_ANSWERS
Q1: Should exports include archived items? A:
Q2: CSV, XLSX, or both? A:
The workflow exits. A human answers the questions. The process resumes exactly where it left off. This prevents the system from inventing requirements simply to keep moving forward.
Isolated Implementation
Once a plan is approved, a separate Coder agent performs the implementation. The implementation never touches the developer's active working tree. Instead, it creates an isolated Git worktree:
REPO="$PWD" bash code-run.sh TASK-42
The result is:
- Dedicated branch
- Isolated workspace
- Generated diff
- Human review before merge
This makes experimentation much safer than allowing AI to modify a live working directory.
Pluggable Deployment
Deployment is intentionally simple. Users define their own validation and deployment commands:
GATE_CMD='bash laravel-gate.sh' \
DEPLOY_CMD='bash ship.sh' \
bash deploy-run.sh TASK-42
If validation fails, the merge is automatically rolled back. The framework doesn't assume anything about your stack.
The Unexpected Discovery
The most important architectural decision turned out to be the simplest one: Files are a better memory system than conversations. When the reviewer starts from scratch and reads markdown artifacts instead of inheriting conversation history, it becomes dramatically more critical. It behaves less like a second opinion from the same person and more like an independent engineer joining the review process for the first time. That independence is exactly what makes the feedback valuable.
Mobile Workflow
I also added an optional Telegram bot. It allows:
- Answering review questions
- Providing steering notes
- Managing multiple tickets
- Monitoring progress remotely
This turned out to be surprisingly useful when away from a laptop.
Example Usage
Generate and review a plan:
REPO="$PWD" bash plan-loop.sh \
TASK-1 \
"add CSV export to reports page"
Implement an approved plan:
REPO="$PWD" bash code-run.sh TASK-1
Open Source
GitHub: https://github.com/execute25/agent-plan-review-loop
The project requires:
- Claude Code CLI
- Bash
- Git
I'm particularly interested in feedback from people building AI coding agents, autonomous development workflows, or review systems. What would you change in this architecture?
Comments
No comments yet. Start the discussion.