UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?
2026-03-03 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial Intelligence
AI summaryⓘ
The authors created a new test called UniG2U-Bench to see when generating images or descriptions actually helps AI models understand visual tasks better. They tested over 30 models and found that simply generating answers before reasoning usually makes performance worse than directly answering. However, generation helps in specific tasks like understanding space, illusions, or multi-step visual problems. They also noticed that models with similar designs behave similarly across tasks, suggesting that how models are trained influences their abilities. Their work points out the need for better data and new methods to improve AI that works with both pictures and text.
Unified multimodal modelsGeneration-to-understanding (G2U)Vision-Language Models (VLMs)Generate-then-Answer (GtA)Spatial intelligenceVisual illusionsMulti-round reasoningInductive biasesPretraining dataMultimodal benchmarks
Authors
Zimo Wen, Boxiu Li, Wanbo Zhang, Junxiang Lei, Xiaoyu Chen, Yijia Fan, Qi Zhang, Yujiang Wang, Lili Qiu, Bo Li, Ziwei Liu, Caihua Shan, Yifan Yang, Yifei Shen
Abstract
Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and shape perception, as well as multi-step intermediate image states, prove beneficial. 3) Tasks with similar reasoning structures and models sharing architectures exhibit correlated behaviors, suggesting that generation-understanding coupling induces class-consistent inductive biases over tasks, pretraining data, and model architectures. These findings highlight the necessity for more diverse training data and novel paradigms to fully unlock the potential of unified multimodal modeling.