Enhancing Spatial Understanding in Image Generation via Reward Modeling

2026-02-27Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created a large dataset called SpatialReward-Dataset to help computers understand spatial relationships better in text-to-image generation. They used this data to build SpatialScore, a model that evaluates how well generated images match the described spatial arrangements. SpatialScore performed better than existing models at judging spatial accuracy. The authors also showed that their model can be used in reinforcement learning to improve image generation with complex spatial details. Their experiments demonstrated consistent improvements in how well images reflect spatial relationships described in text.

text-to-image generationspatial relationshipsreward modelreinforcement learningpreference pairsSpatialReward-DatasetSpatialScoremodel evaluationspatial accuracybenchmark
Authors
Zhenyu Tang, Chaoran Feng, Yufan Deng, Jie Wu, Xiaojie Li, Rui Wang, Yunpeng Chen, Daquan Zhou
Abstract
Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.