StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
2026-05-07 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors developed a method called Strategic Trajectory Abstraction (StraTA) to help AI agents plan better over long tasks by using an overall strategy instead of just reacting step-by-step. StraTA picks a simple plan at the start and then chooses actions based on that plan, training both strategy creation and action steps together. Their tests on different environments showed StraTA learns faster and performs better than existing methods, achieving high success rates. This suggests that adding a strategy level helps AI make more effective long-term decisions.
Large Language ModelsReinforcement LearningTrajectoryHierarchical RLStrategy AbstractionSample EfficiencyCredit AssignmentExplorationGRPO RolloutAgentic Decision Making
Authors
Xiangyuan Xue, Yifan Zhou, Zidong Wang, Shengji Tang, Philip Torr, Wanli Ouyang, Lei Bai, Zhenfei Yin
Abstract
Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. In this work, we present Strategic Trajectory Abstraction (StraTA), a simple framework that introduces an explicit trajectory-level strategy into agentic reinforcement learning (RL). StraTA samples a compact strategy from the initial task state, conditions subsequent actions on that strategy, and trains strategy generation and action execution jointly with a hierarchical GRPO-style rollout design, further enhanced by diverse strategy rollout and critical self-judgment. Experiments on ALFWorld, WebShop, and SciWorld show that StraTA consistently improves both sample efficiency and final performance over strong baselines. StraTA reaches success rates of 93.1% on ALFWorld and 84.2% on WebShop. On SciWorld, StraTA attains a 63.5% overall score, outperforming frontier closed-source models.