The Missing Half of Trust in AI Coding: Verifying AI-Generated Code
The Missing Half of Trust in AI Coding: Verifying AI-Generated Code
AI code generation has scaled. Verification hasn't - and that gap is becoming technical debt.
The application worked. That was the dangerous part. After 47 days, 1,384 commits, around 250 spec files in Obsidian, more than 470 test files, and an estimated 100M+ generated output tokens, I had a working system. The features existed. The tests ran. The pull requests were merged. And I still wanted to throw the codebase away.
Not because AI couldn't write code. Quite the opposite. It could write too much, too quickly, and with too much confidence.
For about half a year, I had been testing one question that I think many engineering teams are now running into: How far can you push autonomous AI coding runs on complex codebases without losing control?
My first answer was sobering: Very far - but not very trustworthily.
At some point, implementation was no longer the bottleneck. The AI was writing code faster than I could meaningfully verify it. And that created a new kind of debt. Not technical debt in the traditional sense. Something closer to what I now think of as validation debt. Every line of code that is generated faster than it is proven becomes an obligation. And that obligation comes back. During review. During debugging. During the next refactor. Or at the moment when you realize: the application works, but nobody wants to touch the codebase anymore.
AI coding didn't improve my implementation loop. It broke my verification loop.
When I review code written by colleagues, I never read it in isolation. I know roughly how they work. I know the codebase. I know which patterns we like, which shortcuts are acceptable, where someone probably made a reasonable decision, and where I should look more closely.
With AI, that changed. The AI often found solutions that worked. But they didn't look like solutions an experienced team would have built in that codebase. Sometimes they were clever. Sometimes unnecessarily complicated. Sometimes locally correct and architecturally wrong. Every review started from zero. I didn't just have to check whether the code produced the right result. I had to check whether I understood why it was built that way. Whether it fit the architecture. Whether the tests actually proved anything. Whether a fix only smoothed over a symptom. Whether a new helper would become a problem three weeks later.
The AI had overtaken my implementation speed. At first, that felt like a breakthrough. Then it started to feel like losing control.
In the end, I was looking at a system that formally "worked", but was practically no longer maintainable. Out of 25 merged pull requests, the application ran - but the code looked so bad that I didn't even want to look at it anymore, let alone debug it. The problem wasn't that AI writes bad code. The problem was that I had allowed it to work without equally fast verification.
The old development principle I temporarily forgot because of AI
Eventually, I remembered something every developer learns the hard way: There is no implementation without verification.
That is not an abstract principle. It is the core loop of software development. We build something. We check it. We adjust. We check again. That is how trust is created. For code we have written a hundred times before, that loop becomes almost invisible. We know what works. We know the failure modes. We catch many errors while typing.
With AI, something different happens. We hand off the implementation, but keep verification with ourselves. That creates an asymmetric workflow: AI-speed implementation. Human-speed verification. That does not scale.
This realization is not new. Around the same time, I kept seeing the same idea show up across the major AI labs and engineering leaders. Anthropic puts it bluntly in its Claude Code best practices: "If you can't verify it, don't ship it." And Addy Osmani, who leads Developer Experience for Chrome at Google, makes it personal: "Your job is to deliver code you have proven to work."
The real insight was not: I need better prompts. The insight was: What AI implements, AI also has to prove. Not at the end of a long run. Not as a summary after the fact. Not as friendly prose saying "I verified this." But during the run. Step by step. Acceptance criterion by acceptance criterion. With evidence that does not come from the same agent that wrote the code.
From that point on, I stopped looking for better prompts. I started looking for a workflow where verification was not optional.
Why prompts are not enough
My first approach was obvious: I wrote better specs. In Obsidian, I moved acceptance criteria directly into the spec files. Every task got a checklist. Similar to project management, but much closer to the code.
That helped immediately. For the first time, I could see per task what was actually supposed to be fulfilled. The AI worked in a more structured way. It corrected itself more often. Reviews became easier. Commits became cleaner. Prompts became fewer.
But the core problem remained. Because an instruction in a prompt is not a guarantee. A skill can say: "Please verify every criterion." But the agent can skip one. It can lose track during long runs. It can claim something was checked. It can point to a test that does not actually prove the criterion. At the end of a long session, it can produce a nice-sounding summary that feels true but leaves no reliable trail.
That is the difference between a rule and a boundary. A rule says: Please do it this way. A boundary says: You cannot move forward until this is done. For AI coding, that difference matters. If verification lives only in the prompt, it is part of the agent's behavior. If verification lives in the system, it becomes part of the infrastructure.
Verification has to move into the data model
The next version of my workflow was based on a simple assumption: An acceptance criterion must not be allowed to move to done unless evidence is attached. Not as a comment. Not as a Markdown flourish. Not as "looks good." But as structured state.
That meant the spec itself needed a data model. A criterion has a status. A criterion has evidence. A criterion belongs to a phase. And any write operation that violates those rules gets rejected. This changed verification from a request to the agent into a condition enforced by the system.
Role separation was just as important. The agent that writes the code should not be the only one deciding whether the code has been proven. So an independent instance had to evaluate each criterion:
- What was required?
- What was changed?
- Which test or check proves it?
- Is the evidence sufficient?
- Was a project rule violated?
- Does this step need to go back into the loop?
This is not a classic review at the end. It is verification at the smallest useful boundary: the acceptance criterion.
Why I made it file-based
I tried different approaches: databases, project management tools, external stores. In the end, the fastest and most robust model was surprisingly simple: Files directly inside the codebase.
Not because files are glamorous. They are not. But for locally running AI agents, they are almost perfect. They are fast. They are versionable. They live where the code lives. They work without additional infrastructure. They are readable by humans. And they make the AI process itself part of the repository.
That last point mattered. If AI is going to work more deeply inside the codebase, then the code should not be the only thing under version control. The process that led to the code should be versioned as well. Tests, gates, acceptance criteria, commit rules, validation logic - none of that should live only in one person's head or inside a chat history. It belongs in the project.
The workflow that came out of it
These experiments eventually became the system I now call anchored. anchored is a Claude Code plugin for a workflow I call Spec-Validated Coding. The idea is simple: The spec defines what matters. The system enforces the proof. Every piece of work moves through the same four steps:
1. plan
The task is broken into phases. Not just as a rough plan, but with testable acceptance criteria. The important point is that each criterion has to be written in a way that someone - or something - can later verify whether it was actually fulfilled. If a criterion is not testable, it is not a good criterion.
2. refine
The plan is checked against the real codebase and the project rules. This is one of the steps I underestimated at first. Many problems do not begin during implementation. They begin earlier: unclear requirements, missing edge cases, wrong assumptions about the existing architecture, or acceptance criteria that are too soft. refine forces the plan to become more concrete before a line of code is written. In my rebuild, this step closed gaps multiple times before they turned into code problems - sometimes as simply as: "AUTO-FIXED: missing acceptance criteria." A bug that never gets built never has to be debugged.
3. build
Implementation happens phase by phase. But the agent cannot simply mark a phase as done just because the code compiles or a test is green. Each criterion needs evidence. And that evidence is validated. If the proof is not sufficient, the step is rejected. If a rule is violated, the step is rejected. If a criterion was skipped, the run cannot cleanly move forward. That is the key difference from a skill-based workflow. The agent is not merely asked to validate. It has to.
4. wrap
At the end, the run is summarized. For small tasks, this is a review of the work. For larger efforts, it can be a roll-up across multiple tasks or phases. The important part is that wrap is not where verification is done retroactively. By then, verification has already happened. wrap summarizes what was proven.
The second build
Once anchored existed, I wanted to know whether it would actually make a difference. So I rebuilt the failed project. Not fixed it. Not cleaned it up. Rebuilt it. Same domain. Same general problem. But this time from the first line onward inside the anchored workflow.
I was convinced it would go better. The numbers still surprised me.
| Metric | Old build | New build with anchored |
|---|---|---|
| Criteria with evidence | 0 of 915 | 514 of 514 |
| Evidence quality | Prose, around 17% with a real test | 70% real test run |
| Caught by validator | 0 | 24 |
| Insufficient evidence | 0 | 16 |
| Rule violations | 0 | 8 |
| Manual investigation reports after the fact | around 48 | 0 |
| Manual findings | 687 | 0 |
| Prompts | around 4,700 estimated | 1,474 measured |
| Output tokens | around 126M estimated | 52M measured |
| Duration | 47 days | 23 days |
| Commits | 1,384 | 484 |
Important caveat: this was not a clean A/B test. The second build obviously benefited from experience. I knew the domain better. I knew which architecture decisions had caused problems the first time. I had learned from the failed attempt. So duration, commits, prompts, and tokens are useful context - but not scientific proof.
The truly important line is another one: 514 out of 514 criteria had attached evidence. The old build could not produce that trail at all. In the old project, "verified" was ultimately a feeling. In the new one, it was state in the system.
What actually changed
1. I was no longer the verification loop
This was the biggest difference. Not speed. Not token consumption. Not even code quality. The biggest difference was that the integrity of the process no longer depended on my discipline. In the old build, I was the bottleneck. After each run, I had to review, ask follow-up questions, sharpen prompts, write findings, read reports, and make the agent test things again. In the new build, the validator handled much of that work before I ever looked at the result. If evidence was insufficient, the step was rejected. If a rule was violated, the step was rejected. If an acceptance criterion was not properly fulfilled, it stayed open. Most of the time, I only saw the green result - but with a trail I could inspect. That feels completely different when working with autonomous AI coding runs.
2. Fixing round-trips collapsed
In the old project, I drove the corrections manually. 687 findings in tickets. 48 investigation and retry reports. One test scene repeated nine times. That was not "AI-assisted development." That was a human cleaning up after a very fast junior developer. In the new workflow, many of those loops happened inside the system. The validator found the issue, rejected the step, the build agent corrected it, and the validator checked again. Of course, the work does not disappear. But it moves. Away from human follow-up. Toward systemic correction. That was the moment AI coding first started to feel truly autonomous to me. Not because the AI became perfect. But because not every mistake made it all the way to me.
3. Verification moved forward
Many teams treat verification as something that happens after implementation. Build first. Check later. With AI, that is dangerous, because implementation can create a huge amount of surface area very quickly. If the spec is weak, the AI will confidently build in the wrong direction. That is why refine became almost as important to me as build. The plan is checked against the real codebase. Acceptance criteria are sharpened. Missing criteria are added. Project rules are considered before code exists. That is not just cleaner. It is cheaper. The cheapest bug is not the one a test finds later. The cheapest bug is the one that never gets implemented.
4. The code stayed readable
This surprised me almost more than anything else. I expected anchored to give me better traceability. I expected cleaner reports. I expected less manual review work. I did not expect the codebase itself to look so much better. The rebuild did not feel like a pile of autonomous agent decisions. It felt like a project built by an engineering team with shared principles. Of course, it was not perfect. There were careless mistakes. There were places I would have written differently. There were decisions one could debate.
Comments
No comments yet. Start the discussion.