Agent Explorative Policy Optimization for Multimodal Agentic Reasoning
2026-05-27 • Computation and Language
Computation and Language
AI summaryⓘ
The authors explain that vision-language models struggle with real-world problems that need using external tools, not just internal thinking. They identify a problem called the Thinking-Acting Gap, where models don't use tools enough or fail when they do, which hurts learning. To fix this, they propose AXPO, a method that improves tool use by retrying unsuccessful attempts and choosing better starting points based on uncertainty. Their experiments show AXPO helps models perform better on various tasks, even outperforming much larger models with fewer parameters.
vision-language modelsagentic reasoningThinking-Acting Gaptool usereinforcement learningAXPOuncertainty-based selectionmultimodal benchmarkspass@k metricpolicy optimization
Authors
Minki Kang, Shizhe Diao, Ryo Hachiuma, Sung Ju Hwang, Pavlo Molchanov, Yu-Chiang Frank Wang, Byung-Kwan Lee
Abstract
Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.