ProductWebGen: Benchmarking Multimodal Product Webpage Generation

2026-05-31Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors focus on automatically creating product display webpages from a product image and instructions about layout and content, which is useful for marketing and e-commerce. They introduce ProductWebGen, a benchmark dataset with 500 samples to test how well models can generate webpages with consistent images and follow instructions. The authors compare two approaches: one using separate language and image editing models, and another using a single unified multimodal model. They find that the editing-based approach is better at following webpage instructions and producing appealing content, while the unified model does better at following visual content instructions. They also build a larger fine-tuning dataset called ProductWebGen-1k and demonstrate it helps improve open-source models.

multimodal generative modelsimage editingunified modelsHTML code generationproduct display webpagebenchmark datasetfine-tuninglarge language modelsinstruction followinge-commerce
Authors
Zhihong Liu, Siqi Kou, Zheng Li, Ye Ma, Quan Chen, Peng Jiang, Kai Yu, Zhijie Deng
Abstract
Crafting a product display webpage from a source product image, along with layout and visual content instructions, holds significant practical value for domains such as marketing, advertising, and E-commerce. Intuitively, this task demands strict visual consistency across product displays and high-fidelity instruction following to jointly generate renderable HTML code. These requirements on controllability and instruction-following are closely aligned with the core features of advanced multimodal generative models, such as image editing models and unified models. To this end, this paper introduces ProductWebGen to systematically benchmark the product webpage generation capacities of these models. We organize ProductWebGen with 500 test samples covering 13 product categories; each sample consists of a source image, a visual content instruction, and a webpage instruction. The task is to generate a product showcase webpage including multiple consistent images in accordance with the source image and instructions. Given the mixed-modality input-output nature of the task, we design and systematically compare two workflows for evaluation -- one uses large language models and image editing models to separately generate HTML code and images (editing-based), while the other relies on a single UM to generate both, with image generation conditioned on the preceding multimodal context (UM-based). Empirical results show that editing-based approaches achieve leading results in webpage instruction following and content appeal, while UM-based ones may display more advantages in fulfilling visual content instructions. We also construct a supervised fine-tuning dataset, ProductWebGen-1k, with 1,000 groups of real product images and LLM-generated HTML code. We verify its effectiveness on the open-source UM BAGEL. The data and code are available at https://github.com/SJTU-DENG-Lab/ProductWebGen.