Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It
2026-06-24 • Computation and Language
Computation and LanguageMachine Learning
AI summaryⓘ
The authors study how reinforcement learning (RL) can help large language models (LLMs) use tools better but found that RL alone often causes unstable behavior and sudden performance crashes. They discovered that errors are linked to unexpected spikes in certain control signals, which mess up the model's ability to use tools properly even though it still knows how. To fix this, the authors tested many types of supervision signals and training methods, finding that mixing supervised learning with RL improves stability but can hurt performance on unfamiliar tasks. Their work shows that combining different training signals helps LLMs learn complex tasks more reliably.
Large Language ModelsReinforcement LearningTool UseSupervised Fine-TuningCatastrophic CollapseControl TokensOff-Policy SupervisionOut-of-Distribution GeneralizationAgentic RLMulti-step Tasks
Authors
Yupu Hao, Zhuoran Jin, Huanxuan Liao, Kang Liu, Jun Zhao
Abstract
Tool use enables large language models (LLMs) to perform complex tasks, and recent agentic reinforcement learning (RL) methods show promise for enhancing model capabilities. However, RL alone often leads to instability or limited gains in tool-use tasks. In our experiments, some models exhibit catastrophic collapse, where performance abruptly drops and tool-invocation structures fail. The analysis reveals that these failures stem from unexpected probability spikes in specific control tokens, disrupting structured execution, yet the underlying tool-use capability remains intact, merely obscured by specific formats. To address this, we systematically investigate a diverse set of supervisory signals, including off-policy supervision, hint-based guidance, erroneous example supervision, and others, applied under both synchronous and interleaved training schemes. We find that interleaving supervised fine-tuning (SFT) with RL substantially improves stability, but exhibits degraded performance under format and content out-of-distribution (OOD) evaluation. We also analyze the impact of learning rates and generalization across settings. These results highlight the importance of understanding RL failures and demonstrate how diverse supervisory signals can guide exploratory learning, enabling robust training of LLMs for complex, multi-step tool-use tasks. Our Code is available at https://github.com/hypasd-art/Tool-RL-Box.