We Let AI Write a Third of Our Code. Here's the Review Process That Kept Us Sane.
Rule Zero: The Human Who Merges It Owns It
The most important change was cultural, not technical. Whoever opens the PR is accountable for every line as if they typed it. "The model wrote it" is not a defense in a postmortem. This one norm ended the skim-and-approve reflex, because now skimming was your name on the incident.
Build an Automated Floor Before You Open the Tap
AI raises the volume of code hitting review. If human review is your only filter, reviewers start rubber-stamping under the load. So we put a deterministic gate before any AI-drafted change reaches a person:
- [ ] type-checks / compiles
- [ ] linter clean
- [ ] static analysis (SAST) finds no known-vuln patterns
- [ ] no secrets introduced
- [ ] tests present and non-trivial
- [ ] coverage does not drop
None of this is AI-specific, which is the point. The floor has to be solid enough to absorb more code without more human hours.
Watch for the Failure Modes Assistants Over-Produce
Generated code fails in characteristic ways, and knowing them makes review faster:
- Mishandled edge cases (empty collections, timezones, integer truncation) that the happy path never exercises
- Hallucinated or outdated API calls that sound plausible
- Security anti-patterns like string-concatenated SQL that models reproduce from their training data
We keep a short reviewer checklist of exactly these. Choosing which assistants and scanners to standardize on was its own project; if you are early in that, it is worth surveying the current AI software development tools rather than defaulting to whatever is bundled with your IDE.
Generate Tests for Human Code, Not the Other Way Around
Test generation was our highest-leverage use, with one caveat about the direction of trust. Generating tests for existing, human-written code is great: the code is the trusted artifact and the tests are scaffolding. But when the model writes both the implementation and its tests, the tests tend to encode the implementation's bugs as "expected."
So the intended behavior is always asserted by a human who understands the requirement:
def test_discount_never_exceeds_cap():
# Business rule: discount capped at 30%, regardless of input.
assert apply_discount(price=100, pct=50) == 70 # capped, not 50
assert apply_discount(price=0, pct=30) == 0 # no negative totals
Measure Delivery, Not Typing
The trap is celebrating "lines generated" or "PRs opened." Those are inputs. We watch change-failure rate, time-to-restore, and defect-escape rate. When generation sped up but change-failure rate ticked up, that was the signal we had shifted work from writing to debugging, and debugging is the expensive end.
The Takeaway
More AI in your pipeline is fine, even great, as long as your review gates, test discipline, and accountability are strong enough that the extra volume makes you faster without making you fragile. The teams that win are not the ones generating the most code. They are the ones who treat generation as cheap and ownership as the real work.
If your team is trying to formalize this at scale, it is essentially the operating model of any serious generative AI software development company: move fast on generation, stay strict on verification.
What does your AI code-review process look like? I am collecting patterns in the comments.
Comments
No comments yet. Start the discussion.