Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents

2026-06-03Computation and Language

Computation and Language
AI summary

The authors created a new test called the Agent Planning Benchmark (APB) to better understand how well AI agents plan tasks before doing them. This test looks at many parts of planning, like breaking down goals and deciding when a task can’t be done, rather than just checking if the final task was completed. They found common problems in current AI models, such as struggling with long tasks and handling tool failures. Using APB helped improve AI plans and their success in real tasks. Overall, APB helps spot planning issues separate from execution problems.

large language models (LLMs)AI planningmultimodal taskstool use in AIbenchmarkingtask decompositionconstraint reasoningplan refinement
Authors
Haoyu Sun, Wenxuan Wang, Mingyang Song, Jujie He, Weinan Zhang, Yang Liu, Yang Yang, Yu Cheng
Abstract
Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success, making it difficult to determine whether failures stem from planning or execution. We introduce \textbf{Agent Planning Benchmark (APB)}, a planning-specific diagnostic benchmark with 4,209 multimodal cases across 22 domains and five settings, covering holistic planning, feedback-conditioned step-wise planning, and robustness under extraneous tools, broken tools, and unsolvable tasks. Across 12 MLLMs, APB reveals systematic weaknesses in long-horizon planning, tool-noise robustness, calibrated refusal, and inference-time refinement. We further validate APB on 200 ToolSandbox tasks and 200 $τ^2$-bench tasks, where APB-guided refinement consistently improves plan correctness, plan grade, and downstream execution metrics across three representative models. APB thus serves as an upstream diagnostic complement to execution benchmarks.