Solaris: Building a Multiplayer Video World Model in Minecraft

2026-02-25Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed Solaris, a new video generation model that can understand and simulate environments with multiple players interacting, unlike previous models that only worked with one player. They created a special system to collect lots of synchronized video and action data from multiplayer games like Minecraft. By training Solaris in steps—from single-player data to complex multiplayer scenarios—the authors improved its ability to predict consistent multi-view videos involving multiple agents. Their results show Solaris works better than older models, and they share their tools and data to support future research in multi-agent video modeling.

action-conditioned video generationmulti-agent interactionmulti-view observationsdata collectionMinecraftvideo world modelstraining pipelineSelf ForcingCheckpointed Self Forcingmulti-agent systems
Authors
Georgy Savva, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Timothy Meehan, Dhairya Mishra, Srivats Poddar, Jack Lu, Saining Xie
Abstract
Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.