Internalizing Agency from Reflective Experience

2026-03-17 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors identify that current training methods for language models mainly focus on final success, ignoring rich feedback during tasks, which limits improvement in solving complex problems. They introduce LEAFE, a system where the model learns from its past mistakes by revisiting and correcting earlier decisions based on feedback. This approach helps the model become better at recovering from errors and improves overall problem-solving ability in long tasks. Experiments show that LEAFE outperforms other methods in coding and agent tasks, increasing success rates by up to 14%.

large language modelsreinforcement learninglong-horizon planningpolicy fine-tuningfeedback groundingagentic tasksPass@k metricexperience-based learningdistribution sharpeningsupervised fine-tuning

Authors

Rui Ge, Yichao Fu, Yuyang Qian, Junda Su, Yiming Zhao, Peng Zhao, Hao Zhang

Abstract

Large language models are increasingly deployed as autonomous agents that must plan, act, and recover from mistakes through long-horizon interaction with environments that provide rich feedback. However, prevailing outcome-driven post-training methods (e.g., RL with verifiable rewards) primarily optimize final success signals, leaving rich environment feedback underutilized. Consequently, they often lead to distribution sharpening: the policy becomes better at reproducing a narrow set of already-successful behaviors, while failing to improve the feedback-grounded agency needed to expand problem-solving capacity (e.g., Pass@k) in long-horizon settings. To address this, we propose LEAFE (Learning Feedback-Grounded Agency from Reflective Experience), a framework that internalizes recovery agency from reflective experience. Specifically, during exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. We then distill these experience-guided corrections into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions. Across a diverse set of interactive coding and agentic tasks under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (GRPO) and experience-based methods such as Early Experience, with gains of up to 14% on Pass@128.

View PDFOpen arXiv