MultiWorld: Scalable Multi-Agent Multi-View Video World Models

2026-04-20Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed MultiWorld, a system that can predict future video frames showing multiple agents acting and interacting from different camera views at once. Unlike previous models that handled only one agent, MultiWorld uses new components to control several agents precisely and keep observations consistent across views. It can handle varying numbers of agents and camera angles efficiently and works well in tests with multiplayer games and robots. According to the authors, it improves video quality and better follows actions compared to earlier methods.

video world modelmulti-agent systemsmulti-view consistencyaction-conditioned video generationvideo predictionmulti-agent controllabilityrobot manipulationenvironmental dynamicsglobal state encodingvideo fidelity
Authors
Haoyu Wu, Jiwen Yu, Yingtian Zou, Xihui Liu
Abstract
Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present \textbf{MultiWorld}, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/