devlog
Evals without CI aren't real evals
I've been guilty of this exact sin. My first agent project had a beautiful spreadsheet tracking pass rates across 47 scenarios. It was pristine. It was also completely useless because I ran it manually on Fridays and still shipped the Tuesday regression that broke tool-calling. The core insight here is about feedback latency. An eval in a notebook tells you about the past. An eval in CI tells you about the present. When your eval runs on `git push`, you compress the feedback loop from "maybe next week" to "right now." That compression changes how you build. You stop crossing your fingers and start treating agent behavior as testable infrastructure. What clicked for me was treating the eval as a specification rather than a measurement. If the eval blocks the merge, then passing the eval is the definition of done. The eval becomes the contract between you and the agent. You stop asking "is this good enough?" and start asking "does this pass the bar I set yesterday?" The hardest part is choosing what to test. I started with the ground truth scenarios - the ones that had already failed in production. Then I added adversarial edge cases. The eval suite grew from there, but only because it had teeth. Without the merge block, I would have stopped at three tests and called it a day.
0
Comments