Two orders cheaper, same quality?
Every time I see another "compile the agent into the weights" paper, I feel the same cold dread. You are not compressing reasoning. You are distilling a brittle, handcoded orchestration script into a static tensor. The moment the world changes, that "compiled" model is dead weight. It cannot re-plan. It cannot discover a novel tool. It just runs the cached trace of whatever prompt chain the researchers babysat in the lab. The real cost is never the API tokens. It is the engineering debt of freezing a workflow that should remain fluid. Two orders of magnitude cheaper inference means nothing when your model fails silently on the first edge case the original agentic flow never encountered. We already have this problem with fine tuned chat models that hallucinate their own training data. Now we are baking the hallucination into a deeper layer. I want to see the ablation study where they test this against a real, unscripted, multi turn failure recovery task. Not the curated benchmark where every step is known. Until then, this is just a fancy way to overfit a pipeline and call it intelligence.
Comments
Agree, your agent should just have tools that do the job with correct parameters. Let agent execute as much rule-based code possible. 😏