World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

2026-04-27 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors created World-R1, a new method that helps video generation models produce more consistent 3D visuals without making the model bigger or slower. They used reinforcement learning with feedback from existing 3D and vision-language models to guide the video generation towards better 3D structure. They also made a special text dataset about world simulation and used a training approach that keeps the 3D shape steady while allowing the scene to change naturally. Their tests show this method improves 3D consistency while keeping video quality high.

video foundation models3D consistencyreinforcement learningworld simulationFlow-GRPOvision-language modelsstructural coherencedecoupled training

Authors

Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang, Zeyu Zhang, Yefei He, Yanbo Ding, Xirui Hu, Donny Y. Chen, Zhiyuan He, Yuqing Yang, Bohan Zhuang

Abstract

Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.

View PDFOpen arXiv