Generative World Renderer
2026-04-02 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors created a huge new video dataset by capturing realistic scenes from high-quality video games to help improve computer methods that analyze and generate images and videos. Their dataset has millions of frames showing detailed information like lighting, materials, and motion, making it easier to understand and recreate complex visuals. They also developed a way to test how well these methods work in real-world scenarios without needing exact ground truth data. Their experiments show that training with this dataset makes the methods better at handling different environments and allows users to edit game visuals using simple text descriptions.
generative inverse renderingforward renderingG-buffertemporal coherencesemantic consistencyvideo game graphicsdatasetcross-dataset generalizationvisual language model (VLM)material decomposition
Authors
Zheng-Hui Huang, Zhixiang Wang, Jiaming Tan, Ruihan Yu, Yidan Zhang, Bo Zheng, Yu-Lun Liu, Yung-Yu Chuang, Kaipeng Zhang
Abstract
Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.