DEV Community

The Most Dangerous Code in Your Repo Is the Behavior Nobody Can Prove

We spend a lot of time arguing about code quality. Is it clean? Readable? Tested? Too clever? AI-generated? All fair questions, but I think we're missing a more important one: Can the repository actually prove what this behavior is supposed to do?

Not the developer. Not the product manager. Not the person who has been on the team for four years and somehow knows that this weird edge case is intentional. The repository itself. Because when a system gets old enough, "how it works" and "what the repo can prove" slowly become two different things. And that gap is where bugs hide.

Code is not the same as behavior

A repository can show you code, commits, who changed a file and when. But code history is not behavior history. Git can tell you: "This function changed on Tuesday." It usually cannot tell you: "This function handles duplicate webhook delivery because the payment provider retries failed events for 24 hours, and changing this logic without updating the idempotency tests is risky."

That second sentence is the thing teams actually need, and in many codebases it only exists in someone's head. Sometimes it lives in a Slack thread from 2022, or a Jira ticket nobody can find, or a PR comment that made sense at the time. Sometimes it doesn't exist anywhere. The code still runs. The tests still pass. The team still ships. But the repo has lost the reason.

AI makes this worse, but it didn't create the problem

It's easy to blame AI coding tools here. AI generates code that looks clean, tests that pass, and explanations that sound reasonable. But every team has had code like this long before AI:

if (user.country === "US" && order.total > 0) {
    enableManualReview = true;
}

Why? Fraud rule? Tax rule? Old business requirement? Temporary workaround that became permanent? Nobody knows.

Now add AI to the picture. A coding agent sees that condition and tries to "simplify" it. A reviewer sees a clean diff. The tests pass because nobody ever wrote a test for the original behavior, and the PR gets merged. Two weeks later, someone asks why orders stopped entering manual review. The behavior was real. The evidence was missing.

A green test suite is not proof of understanding

I love tests, but tests can only prove what someone remembered to assert. A test can prove that a function returns 200. It may not prove that the function is allowed to retry, that the retry must be idempotent, or that the timeout exists because a mobile client depends on it.

A passing suite often means "the checks we wrote still pass" - not "the behavior users depend on is still protected."

This is why missing evidence deserves to be treated as a signal, not just an inconvenience. If a PR changes payment logic and no payment tests changed, that doesn't automatically mean the PR is bad - but it's a signal. If a feature has five implementations, three naming styles, no docs, and tests that only cover happy paths, the next change to it is more dangerous than it looks.

The repo is quietly telling you: "I may work, but I cannot fully explain why." That difference matters more now because code is getting easier to produce. When code becomes cheap, understanding becomes expensive.

The new code review question

Traditional review asks whether the code is correct, readable, secure, and tested. Those questions still matter. But I think every risky PR needs one more: What behavior does this change claim to preserve, and where is the evidence?

Take a PR that touches billing webhooks. The review shouldn't only ask whether the code looks good. It should ask:

  • Does this affect duplicate delivery?
  • Retry behavior?
  • Event ordering?
  • Refunds?

And for each answer - is there evidence, or just confidence? This isn't about adding bureaucracy. It's about refusing to merge changes that look safe only because the repo cannot explain itself.

"Documentation" is not enough

The usual answer is: write better docs. I half agree. Docs are useful when they stay close to the system and dangerous when they drift away from it.

  • A README saying "webhooks are idempotent" is nice.
  • A test proving duplicate delivery is safe is better.
  • A decision record explaining why retries work this way is better still.
  • And a tool that connects the claim to the actual files, tests, routes, and recent diffs is the best version of all.

The future of software maintenance isn't more documentation. It's evidence-backed documentation: here is the claim, here is the evidence, here is what's still uncertain. A repo should be allowed to say "I don't know" - that's much healthier than pretending.

Maybe repos need an "understanding score"

We measure test coverage, build time, bundle size, churn, security findings, dependency risk. We almost never measure whether a repository can explain its own important behavior.

Imagine opening a repo and seeing:

Area Evidence
Authentication Strong
Billing webhooks Partial
File uploads Weak
Admin permissions Unclear
Email delivery No decision record found

Not because the score would be perfect, and not because it replaces human review - but because it points attention at the parts of the system where the confidence is fake. And fake confidence is expensive.

The real bottleneck

AI tools will keep improving. They'll write more code, open more PRs, fix more bugs, generate more tests. That's not the problem. The problem is that faster code generation doesn't automatically create better system understanding - a team can ship more code while understanding less of the behavior they're preserving.

The bottleneck isn't syntax, boilerplate, or even review speed. The bottleneck is evidence. Can we prove what this behavior is, why it exists, and that this change doesn't break it? Until repositories can answer those questions, we're relying on memory, folklore, and confidence. And confidence is not evidence.

The most dangerous code in your repo isn't always the ugly code, the old code, or the AI-generated code. Sometimes it's the behavior everybody depends on but nobody can prove. Once you start looking for it, you see it everywhere.

I've started exploring this idea in an open-source CLI tool called DevTime - it scans a repository and generates evidence-backed explanations of how things actually work, linking every claim to real files and tests. If the problem in this post sounds familiar, I'd love your feedback on it.

What's the piece of "everybody knows" behavior in your codebase that nobody can actually prove? Tell me in the comments.

Comments

No comments yet. Start the discussion.