My agent dry-ran fine in staging 100 times - then wrecked production on the first real run
A staging-to-production data bleed cost me 4 hours of rollback. That's what finally made dry-run a structural requirement, not an afterthought.
The common advice is: test in staging, promote when green. The problem is environment drift. My D1 schema changes once or twice a week, and a solo operator can't keep staging perfectly synchronized. Worse, agents don't have fixed execution paths - the same input can produce a different tool call sequence on the next run. I ran a flow 100 times in staging and still hit a fresh path on the first production execution.
The Real Problem: Mock Responses
The most surprising thing I learned after 6 months of running this: latency wasn't the problem I expected. KV writes averaged 12ms - basically imperceptible. The real problem was that mock responses fool the agent into treating skipped writes as real successes.
I'd dry-run an R2 put, the agent would believe the file was uploaded, and then proceed to write metadata to D1 - which was not in dry-run scope. Real write, orphaned record.
The Fix: Propagate Dry-Run Flag
Once any write tool in a run hits dry-run, propagate a flag for that runId that forces all subsequent writes in the same run to dry-run too.
// after intercepting first dry-run write
await ctx.env.KV.put(`dryrun_active:${ctx.runId}`, "1", {
expirationTtl: 3600,
});
// every subsequent hook checks this flag
const isDryRunActive = (await ctx.env.KV.get(`dryrun_active:${ctx.runId}`)) === "1";
Hook Failure Edge Case
One more thing that burned me: if the hook itself fails - say, KV goes temporarily unavailable - Claude Code's default behavior is fall-through. The tool call executes anyway, dry-run flag ignored. Last week a KV spike caused hook timeouts and 3 agents wrote directly to production. No data loss because those ops were idempotent, but it was luck. Hook failure needs its own alert, separate from agent failure.
I wrote up the full breakdown - including the dry-run propagation edge cases, R2 + D1 orphan scenarios, and where this pattern completely falls apart (read-modify-write loops, APIs with side-effectful reads) - over on riversealab.com.
Comments
No comments yet. Start the discussion.