Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification
2026-03-27 • Software Engineering
Software EngineeringArtificial Intelligence
AI summaryⓘ
The authors created Vision2Web, a big test to see how well computer programs can build websites from pictures and designs. It checks skills at different steps, like making simple pages, handling multi-page sites, and building full websites with both frontend and backend. They made this test with real website data and many examples to try out. The authors also made a special way to check answers using two types of helpers that look at what the computer built. When they tested current top models, they found these models still have a hard time completing complex website tasks.
large language modelscoding agentsUI-to-code generationfrontend developmentfull-stack developmentbenchmarkvisual language modelsagent verificationworkflow-based evaluationmulti-page websites
Authors
Zehai He, Wenyi Hong, Zhen Yang, Ziyang Pan, Mingdao Liu, Xiaotao Gu, Jie Tang
Abstract
Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.