MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
2026-04-16 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial IntelligenceComputation and Language
AI summaryⓘ
The authors created MM-WebAgent, a system that helps make web pages by coordinating different AI tools to generate elements like images and text in a way that looks consistent and well-organized. Instead of making parts separately and ending up with a messy page, their method plans everything in steps and checks itself to improve design. They also made a new way to test how well these AI-generated webpages work. Their experiments show that MM-WebAgent does better than other existing methods, especially when combining different types of content.
Artificial Intelligence Generated Content (AIGC)webpage generationmultimodal contenthierarchical planningself-reflectionUI/UX designlayout optimizationbenchmarkevaluation protocol
Authors
Yan Li, Zezi Zeng, Yifan Yang, Yuqing Yang, Ning Liao, Weiwei Guo, Lili Qiu, Mingxi Cheng, Qi Dai, Zhendong Wang, Zhengyuan Yang, Xue Yang, Ji Li, Lijuan Wang, Chong Luo
Abstract
The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.