Empowering GUI Agents via Autonomous Experience Exploration and Hindsight Experience Utilization for Task Planning

2026-06-25Computation and Language

Computation and LanguageArtificial IntelligenceComputer Vision and Pattern RecognitionMachine Learning
AI summary

The authors studied small multimodal language models (MLLMs) that help with repeated web tasks but found these models struggle with planning and working across different websites. They created a method called PEEU that lets the model explore and learn from past experiences to improve high-level planning. They also developed a way to analyze tasks at different levels of detail and found that just knowing simple skills doesn't mean the model can plan well overall. Their approach helped a smaller model perform better than a much bigger one on real web benchmarks, showing that learning from big-picture tasks and past experience is important.

Multimodal web agentsSmall multimodal language models (MLLMs)Task planningCross website generalizationPlanning experience exploration and utilization (PEEU)Hindsight experienceTask decomposition hierarchical analysis framework (TDHAF)Compositional generalizationOut-of-distribution (OOD) generalizationGUI tasks
Authors
Tianyi Men, Zhuoran Jin, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Abstract
Multimodal web agents can assist humans in operating repetitive GUI tasks, where effective task planning is essential for decomposing complex tasks into executable actions. While small open source MLLMs are cost efficient and privacy preserving compared with commercial large models, they suffer from weak planning and limited cross website generalization. To address these limitations, we introduce the planning experience exploration and utilization (PEEU) method, which autonomously explores environments to discover experiences and utilizes hindsight experience to synthesize strictly aligned, high level training data. To quantitatively analyze the generalization behaviors driving this performance, we propose the task decomposition hierarchical analysis framework (TDHAF) to systematically study compositional generalization across three task granularities: low, middle and high levels. Our analysis reveals that mastering low level atomic skills does not guarantee high level planning competence, while high level task training yields stronger OOD generalization. Experiments on real world benchmarks demonstrate PEEU's superior effectiveness: our 7B model achieves 30.6% accuracy, outperforming the much larger Qwen2.5-VL-32B model. These demonstrate constructing hindsight high level tasks and leveraging experiences is crucial for OOD planning abilities of small MLLMs.