DEV Community 2h ago

Why teaching AI agents to use tools keeps blowing up in training

Multi-step reinforcement learning for tool-using AI agents collapses mid-training not because the model loses its skills, but because the probabilities of a few structural control tokens spike and scramble the agent's execution scaffolding. The fix, according to a new paper, Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It, is to interleave supervised learning with the reinforcement training, keeping those control tokens in check.

Key Facts

What: A new paper pins the sudden collapse of multi-step tool-use training on runaway probabilities in a few control tokens, and shows that mixing in supervised examples stabilizes it.
When: 2026-06-26
Primary source: read the source (arXiv 2606.26027)

The Problem

Reinforcement learning polishes a model after its initial training: the agent tries things, gets rewarded for good outcomes, and adjusts. For a tool-using agent, a single task can require a dozen sequential actions - call a search tool, read the result, call a calculator, format an answer - with reward only arriving at the end. That long chain is exactly what makes the training fragile: a small problem early on cascades through every subsequent step.

The authors' finding is precise. The collapse is not skill loss; the underlying capability stays intact. Instead, the training causes unexpected probability spikes in a few specific control tokens - the small structural markers that tell the system when to start a tool call or when to stop. When one of these control tokens balloons out of proportion, it scrambles the agent's structured execution. The capability remains; the scaffolding that organizes it breaks.

Think of a skilled chef who, under stress, develops a compulsive tic of shouting "next, next, next" out of turn. The cooking knowledge is untouched, but the kitchen's choreography collapses because the timing commands that coordinate the line have gone haywire. The dishes come out wrong not because the chef forgot how to cook, but because the control signals got corrupted. That is what runaway control-token probabilities do to an agent.

The Fix

The fix is to stop training purely by trial and error and weave in supervision. The researchers tested several kinds of guidance - correct examples, hints, and even deliberately bad examples to learn from - and found that interleaving ordinary supervised learning with the reinforcement learning keeps control tokens in check and training stable. Pure self-directed practice destabilizes; mixing in a teacher who occasionally shows the right way steadies the whole process.

Why It Matters

This is a load-bearing problem for the entire agent boom. Every company racing to ship agents that book travel, write and run code, or manage workflows has to train them on exactly these long, multi-step tool-use tasks, and instability in that training is a hidden tax that wastes expensive compute and produces unreliable agents. A clean diagnosis - the problem is control tokens, not lost capability - plus a concrete remedy - blend in supervision - is the kind of unglamorous result that quietly makes the next generation of agents more dependable. It pairs naturally with the week's other agent-reliability work, including research on rewarding agents without a clear referee.

Caveats

The fix is not free. The authors note that interleaving supervised training with reinforcement learning can hurt performance on out-of-distribution tasks - situations that look different from the training examples. That is a real trade-off: leaning on supervised examples stabilizes training but can also tether the agent to the patterns it was shown, making it less adaptable when the world throws it something genuinely new.

This is also a single study on specific setups, and like much agent-training research it will need replication across more models and tasks before the recipe is treated as settled. Still, naming the precise failure mode is a meaningful step, because you cannot fix what you cannot see.

Originally published on Ground Truth, where every claim is checked against the primary source.

Read on DEV Community ↗ ← Back to News