WorldOlympiad: Can Your World Model Survive a Triathlon?

2026-06-09Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created WorldOlympiad, a new test to see how well computer models can predict and generate videos that make sense physically, geometrically, and in terms of interactions over time. Unlike other tests that focus mainly on how videos look, this benchmark checks if videos follow real-world rules, keep 3D shapes consistent, and handle complex actions smoothly. They include three parts: one checks physical rules, another looks at 3D structure and camera movement, and the last tests if actions in videos follow instructions. Their experiments show current models struggle with these challenges, suggesting better tests are needed.

world modelsvideo generationphysical reasoninggeometric consistencyinteraction fidelityobject segmentationGaussian splattingmultimodal large language modelcamera trajectorylong-horizon prediction
Authors
Yuke Zhao, Wangbo Zhao, Weijie Wang, Zeyu Zhang, Dakai An, Akide Liu, Yinghao Yu, Jiasheng Tang, Fan Wang, Wei Wang, Bohan Zhuang
Abstract
We introduce WorldOlympiad, a benchmark for diagnosing video-based world models across physical faithfulness, geometric consistency, and interaction fidelity. While existing benchmarks often focus on visual quality, semantic alignment, or short-term temporal coherence, they provide limited insight into whether generated videos obey physical rules, preserve coherent 3D structure, and sustain controllable interactions over long horizons. To address this gap, WorldOlympiad decomposes world-model evaluation into three complementary dimensions. The physical track uses object segmentation and MLLM-as-judge to assess whether generated videos follow interpretable rules in mechanics, thermal phenomena, and material properties. The geometry track reconstructs generated videos with Gaussian splatting and evaluates structural consistency, cross-view coherence, and camera-trajectory alignment. The interaction track assesses whether generated rollouts follow complex action prompts and maintain smooth, coherent transitions across consecutive video chunks. WorldOlympiad further covers three major downstream scenarios, including gaming, robotics, and general real-world videos, capturing diverse challenges from interactive control and embodied manipulation to open-domain motion and camera dynamics. Together, these tracks and scenarios form a scalable and interpretable evaluation suite that exposes failure modes beyond generic video quality. Experiments on state-of-the-art models reveal substantial gaps in physical reasoning, 3D consistency, and long-horizon interaction, underscoring the need for more structured evaluation protocols for generative world models.