DEV Community

Your Agent's Retries Are Double-Charging Your Users (and Every Eval Is Green)

Your agent calls a tool. The tool times out at the network layer but actually succeeds on the server. Your harness sees no response, so it retries. Now charge_customer ran twice, send_email fired twice, and create_ticket left two tickets. The model did nothing wrong. Every eval you have is green. And a customer just got billed $198 for a $99 plan.

This is the failure mode nobody puts in a demo, because demos don't retry and demos don't have side effects that matter. Production has both. If your agent takes actions - not just generates text - then retry safety is not a nice-to-have, it is the difference between an autonomous system and a liability with a scheduler.

I want to argue two things. First: side-effect safety is a Tier 1 evaluation problem, not a prompt problem. Second: you cannot even see this class of bug without a trace of what the agent actually did, which is where the eval story and the observability story become the same story.

Why the model can't save you here

The instinct is to make the agent smarter. "Tell it to check whether the charge already went through before retrying." Please don't. You are asking a non-deterministic component to enforce an invariant that must hold every single time. The model will comply 95% of the time and the other 5% is a chargeback.

Retries don't come from the model anyway. They come from your harness, your HTTP client, your queue's at-least-once delivery, a Kubernetes pod restart mid-execution. The agent's "reasoning" is nowhere near the retry. So no amount of judging the agent's output tells you whether the effect happened once or twice.

This is exactly why I think about evidence on an independence axis, not a cost axis. Evidence is only worth what the agent couldn't forge:

  • Tier 1 - proof the agent can't fake. The side effect is externally observable. Did exactly one charge with this idempotency key hit Stripe? Does exactly one ticket exist? This is a deterministic yes/no you read from the world, not from the agent.
  • Tier 2 - statistical signal vs a baseline the agent didn't author. Retry-rate per tool trending up, duplicate-detection hits, latency distributions shifting. Signal, cheap, real-time.
  • Tier 3 - model-as-judge. Useful for "was this refund reasonable?" Useless for "did it happen twice." A judge is a shared-substrate opinion: a signal, never a verdict, and never allowed in the hot path.

Double-execution lives entirely in Tier 1. It's binary, it's forgery-proof, and it's the 80% of production incidents that never needed an LLM to catch.

The actual fix: idempotency keys the agent doesn't control

The correct architecture makes double-execution impossible, then evaluates that the invariant held. The key insight: the idempotency key is derived by the harness from the intent, not minted by the model on each attempt.

import { createHash } from "node:crypto";

type ToolCall = {
  tool: string;
  args: Record<string, unknown>;
  runId: string;
};

// The key is a pure function of intent - identical across retries,
// because the AGENT never generates it. The harness does.
function idempotencyKey(call: ToolCall): string {
  const canonical = JSON.stringify({
    tool: call.tool,
    args: call.args,
    runId: call.runId, // one logical action per run, not per attempt
  });
  return createHash("sha256").update(canonical).digest("hex");
}

async function executeOnce(
  call: ToolCall,
  sideEffect: (key: string) => Promise<unknown>
) {
  const key = idempotencyKey(call);
  // Tier 1 proof, checked BEFORE the effect: has this exact intent run?
  const prior = await ledger.get(key);
  if (prior?.status === "committed") {
    return { key, result: prior.result, replayed: true }; // no second charge
  }
  await ledger.put(key, { status: "in_flight" });
  const result = await sideEffect(key); // pass key downstream to Stripe et al.
  await ledger.put(key, { status: "committed", result });
  return { key, result, replayed: false };
}

Now the retry is safe by construction: a second attempt with the same intent replays the recorded result instead of firing the effect again. But - and this is the part people skip - being safe is not the same as knowing you're safe. You still have to prove it in your evals.

The eval and the trace are one system

Here's where I'll stop describing generic hygiene and tell you what I actually run: agent-eval to score and gate the output, and AgentLens to capture the trace it scores against. They ship as a unit for a reason I only appreciated after getting burned.

agent-eval owns the Tier 1 gate. After every run it asserts the invariant against ground truth the agent could not author:

import { evaluate } from "agent-eval";

const report = await evaluate(run, {
  checks: [
    // Tier 1: externally observable, unforgeable proof.
    {
      id: "single-charge",
      tier: 1,
      run: async ({ trace }) => {
        const key = trace.toolCalls.find(
          (t) => t.tool === "charge_customer"
        )?.idempotencyKey;
        const hits = await stripe.charges.list({ metadata: { key } });
        return {
          pass: hits.data.length === 1,
          detail: `charges=${hits.data.length}`,
        };
      },
    },
    // Tier 2: statistical signal vs a baseline the agent didn't set.
    {
      id: "retry-rate",
      tier: 2,
      run: ({ trace }) => {
        const retries = trace.toolCalls.filter((t) => t.replayed).length;
        return {
          pass: retries <= baseline.p95,
          detail: `replays=${retries}`,
        };
      },
    },
  ],
});

Notice trace.toolCalls and t.idempotencyKey. Where does that come from? AgentLens. It records every model step and every tool step - resolved inputs, the idempotency key the harness derived, the raw provider response, whether the attempt was a replay or a fresh effect. Without that trace, the "single-charge" check has nothing to read.

The agent's own summary ("I charged the customer once") is exactly the self-report you must not trust - it's shared-substrate, the agent authored it. That's the whole thesis of pairing them. Tier 1+2 only mean something if they run over data the agent didn't get to write. AgentLens produces the unforgeable substrate; agent-eval renders the verdict. One captures how the agent got there, the other decides whether "there" was correct - and critically, Tier 1+2 run over trajectories in real time at roughly $0, while the judge stays offline for the subjective tail where it belongs.

Ship the 80%

You do not need a smarter model to stop double-charging customers. You need:

  • Idempotency keys the model never touches, so retries are safe by construction.
  • A Tier 1 check that reads the world and asserts exactly-once - the deterministic, real-time gate.
  • A trace (AgentLens) unforgeable enough that the check has real ground truth to read.

Reserve the model-as-judge for the genuinely subjective 20% - "was this refund fair?" - and label its output opinion, not evidence. The retry storm quietly draining your customers' cards is not in that 20%. It never was. It's the most catchable bug you have, sitting in Tier 1, waiting for you to look at the trace.

Comments

No comments yet. Start the discussion.