Green Tests Don't Mean Better Software
DEV Community

Green Tests Don't Mean Better Software

Green Tests Don't Mean Better Software

You spent the week on a refactor because it was supposed to make the next change cheaper. The tests go green, the PR merges, and that's the last anyone thinks about it. Nobody comes back to check whether the next change actually got cheaper. The green bar said the code conforms to its spec, everyone read that as done, and the reason you did the work in the first place went unmeasured.

What green actually checks

Your CI is green. The test asserts assert response.status == 200, and it passes. It does not assert that the caching layer you just shipped cut p99 latency, or that the reworded onboarding step lifted activation, or that the prompt change made the agent follow instructions more often. The green bar answered a question you never asked.

What did that green actually tell you? That the change conforms to the spec - the inputs map to the asserted outputs, the contract holds, nothing you wrote a test for regressed. That is correctness. It is real and it is necessary. It told you nothing about whether the system got better. You shipped because the bar was green, and the bar was never measuring the thing you shipped the change for.

Those latency, activation, and adherence numbers are effects. CI does not measure effects. It measures conformance, and then we read the green bar as if it answered the other question too. The silent assumption is that "passes" and "better" are the same axis - that a change which is correct is, by that fact, an improvement. They are not the same axis. A perfectly correct change can make the system worse, and a perfectly correct change can leave every metric exactly where it was; the suite goes green for both. Green is necessary for "better" and nowhere near sufficient for it, and almost no team instruments the difference.

Two disciplines, pitched as rivals

The industry already named both halves of this. It named them as rivals.

Spec-Driven Development says: write the specification first, then build to it, then prove the build conforms. The test suite is the executable form of the spec. Correctness is the whole game. SDD is disciplined, auditable, and it is what most engineering orgs actually run. As a flow it terminates the moment the bar goes green:

  • Write spec
  • Build to spec
  • Prove conformance
  • Stop at green

Notice where it stops. It stops at correct. Nothing in that loop asks whether the shipped change improved anything.

Hypothesis-Driven Development asks the opposite question. Thoughtworks framed it as a triple: we believe X will result in outcome Y; we will know we are right when we see measurable signal Z. The unit of progress is not a passing build - it is validated learning. You predict an effect, you ship, you measure, and the measurement tells you whether you were right. PMI's expectation-management literature says the same thing from a different room: an expectation is a managed object with an owner and a due date, tracked and surfaced early rather than discovered late. As a flow it begins where SDD ends:

  • Make prediction
  • Ship
  • Measure
  • Grade at measurement

These get pitched as competing philosophies - spec-first versus hypothesis-first, prove-correct versus prove-better, pick a camp. Look at the two flows again. One ends at correct. The other starts at predicted and ends at better. They answer two different questions, and the pick-a-camp framing assumes they answer one.

Thoughtworks HDD and PMI expectation-management are not novel claims I am making here; they are prior art, decades of it, and I am citing them as validation. What none of them did was notice the two flows were never aimed at the same question.

The orthogonality move

Two different questions means two different axes. You cannot answer one by measuring the other - and a single pass/fail bar is built to answer only the first.

Correct sits on one axis; better sits on the one perpendicular to it. One axis is spec-conformance: does the change do what it was specified to do? Pass or fail, and pytest already answers it. The other axis is effect-verdict: did the change move the metric it was supposed to move? Confirmed, refuted, or not yet known - and nothing in your pipeline answers it today.

Lay them on a 2×2:

effect confirmed effect refuted / unmeasured
spec passes (green) shipped and works green but no better
spec fails (red) blocked (correctly) blocked (correctly)

Three of those quadrants are familiar. Red blocks the merge, whichever way the effect would have gone. Green-and-confirmed is the win you wanted. The quadrant nobody names is the top-right: green but no better. The change is correct, it conforms to spec, and it shipped - and it did not improve the system, or you never checked, which from the system's point of view is the same thing. That quadrant is where most shipped changes actually live, and it is invisible because the only instrument pointed at it is the green bar, which is pointed at the wrong axis.

This is the reusable mental model: stop asking "did it pass" as if it were one question. It is two questions on two axes, and you are only instrumenting one of them.

A worked example: green architecture tests

Make it concrete with a case every Python team recognizes. You adopt the hexagonal layout - Ports and Adapters. A pure core (contract/, dto/, policy/) imports only the standard library. Adapters wrap the I/O. Subsystems compose the adapters. The dependency rule is one sentence: dependencies point inward, and the pure core points at nothing.

You enforce it with an architecture test that runs on every commit:

def test_core_purity():
    for module in pure_core_modules():
        assert no_imports(module, {"yaml", "requests", "subprocess", "open"})

It is green. Every pure module is import-clean. The dependency graph conforms to the rule - proven correct, mechanically, on every push.

Now ask the other question. You did not adopt the hexagonal layout for its own sake. You adopted it on a claim: changes would get cheaper, blast radius would shrink, the core would be testable without a single mock. Did that happen? test_core_purity cannot tell you. It measures the shape of the import graph, not whether the shape paid off. Green here means conformant. A conformant architecture that made nothing cheaper is the green-but-no-better quadrant with a .py file pointing straight at it.

That is the whole thesis in one test file. The architecture test lives on the spec axis. The promise that sold the refactor lives on the effect axis. They do not touch, and only one of them has an instrument.

The expectation primitive

If "better" is its own axis, it needs its own instrument. That instrument is an expectation - the HDD flow above, made into a first-class artifact.

An expectation is a change-scoped, falsifiable, deadline-carrying prediction. It is the HDD triple - believe X, expect outcome Y, know it by criterion Z - turned into an artifact that travels with the change instead of living in a planning doc nobody reopens. It carries a baseline, a bound measurement view - the query, tied to that change, that reads the metric - a threshold, and a due date.

Concretely: the caching change should cut p99 latency below 200ms, measured against last week's baseline, re-checked Friday. That sentence is the whole artifact, and it rides attached to the diff.

Here is the part that is genuinely new, and the part I want to be precise about: the verdict is measured by the system. HDD and PMI both already told you to make a falsifiable prediction with an owner and a deadline. That is not the advance. The advance is that the prediction binds to a real measurement view, and a system reads that view and stamps the verdict itself. The human who made the prediction does not come back and grade it. The human leaves the verification loop.

This is not a dashboard alert. A dashboard alert fires on a metric crossing a line, untethered to any one change. An expectation binds one prediction to one change and grades that prediction - the dashboard tells you latency rose; the expectation tells you the change you predicted would cut it did the opposite.

The verdict vocabulary is three words, and the discipline is in the third:

  • confirmed - the measured value met the threshold. Auto-resolved, silently. Announcing it would be filler.
  • refuted - the measured value missed the threshold. The expectation stays open and surfaces, with the predicted delta and the actual sitting next to each other.
  • inconclusive - the measurement produced no scalar yet. It stays open and surfaces. It is never read as confirmed. "We didn't measure" is not "it worked."

That third word is the one most dashboards quietly drop. A prediction you cannot yet grade is not a pass. Keeping inconclusive distinct from confirmed is what stops the green-but-no-better quadrant from refilling under a different name.

The selector

Routes by the change's effect, not its shape. A one-line diff can carry a large measurable effect and owe an expectation; a thousand-line refactor can be pure conformance and owe none - and you pick the lane when you ship, not after.

Run the selector over that same hexagonal project. Re-homing a module to clear a test_subsystem_isolation failure is a shape change. It restores the boundary, its effect resists measurement, it owes nothing past the green test. The refactor that promised cheaper changes is the other lane - it made a measurable claim, so it owes an expectation: adding the next artifact class touches three files or fewer, re-checked when the next one lands. Same codebase, two changes, two lanes.

This is what the spec-versus-hypothesis fight always missed. The selector does not pick a winner between SDD and HDD. It is conditional. It runs SDD's discipline on changes whose effect resists measurement and HDD's discipline on changes whose effect can be measured, and it picks per change. A discipline that reconciles two camps looks like a routing rule that knows when each camp is right.

There is one trap to name: do not attach an expectation to an unmeasurable change just to satisfy the selector. A prediction with no real threshold is conformance wearing a costume, and it teaches everyone to ignore the verdicts. If the effect resists measurement, declare none. The honesty of the refuted and inconclusive verdicts depends on the selector being willing to say "this change owes nothing."

This is the part that has to survive contact with a real backlog. The tempting first cut is to flag every shipped change as owing a prediction. Try it and it floods on contact: the renames, the prose-clarity passes, the reorganizations - the bulk of any change history - all light up red against a rule they cannot satisfy, because there was never a metric for them to move. The flood is the proof. A selector that cannot say "this change owes nothing" does not enforce the discipline; it discredits it. The conditional design is not a softening of the rule - it is the only version of the rule that does not collapse the first time you run it over changes that already shipped.

Proof of existence, and the cost of skipping it

A coordination system shipped exactly this layer, in working code. A change to a governed surface declares an expectation: predicted delta, bound measurement view, due date. The system re-measures when the due date passes, stamps confirmed / refuted / inconclusive against the threshold, auto-resolves the confirmed ones silently, and surfaces the refuted and inconclusive ones the next session, one line each. The selector decides which changes owe a prediction and which fall back to conformance.

The point here is only that the pairing - a governance selector plus an auto-measured verdict - is buildable today, because it has been built. The mechanism is the proof; the product is not the subject.

confirmed is the minority verdict

What running it actually teaches is the part the diagram cannot. Before the verdicts come back, you assume most predictions will land - you shipped the change believing it would help, so of course the measurement will agree. It does not.

Once an instrument is actually pointed at the effect axis, confirmed turns out to be the minority verdict. Most changes come back inconclusive or refuted: the metric did not move past the threshold, or there was no scalar to read in the window at all. The first time a batch of verdicts surfaces and almost none of them are confirmed, the 2×2 stops being a diagram. The green-but-no-better quadrant is not a rare corner you occasionally fall into - it is where most changes sit, and you only learn that the moment you can finally see it.

inconclusive is the common case

Inconclusive is the verdict you underestimate most. "We didn't measure" happens far more often than any prediction would lead you to expect - the window was too short, the metric needed traffic that had not arrived, the effect was real but not yet legible. The entire value of the third word is that it refuses to round itself up. Without it, every one of those un-graded predictions quietly refiles as a pass, and the quadrant refills under a name that looks like success. The discipline lives in holding inconclusive open, in plain sight, until there is a scalar - and discovering, run after run, how often that takes longer than you thought.

Even confirmed is not permanent

Even the confirmed ones are not closed for good. A threshold set too loose can confirm a change that later regresses, and a silent auto-resolve would bury exactly that. A confirmed verdict earns its silence; it does not earn permanent trust.

One discipline already exists for this. SkillOpt (Yang et al., 2026) accepts a self-authored skill edit only when it improves a held-out validation split - not the data the edit was tuned against. The held-out split is the guard against exactly that failure: a verdict that reads confirmed because the threshold was set on the same signal the change was shaped to move. An auto-measured verdict needs the same guard: a measurement view that the change's author cannot also tune.

Comments

No comments yet. Start the discussion.