ActionParty: Multi-Subject Action Binding in Generative Video Games

2026-04-02Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning
AI summary

The authors address a problem in video diffusion models, which usually control only one character at a time, by creating ActionParty, a new model that can control multiple characters simultaneously in a generated video game environment. They introduce a method to keep track of each character's state separately, helping the model know who is doing what. Tested on a benchmark called Melting Pot, their model controlled up to seven characters accurately, improving how well the model follows actions and keeps track of identities during complex scenes. This work helps make video-based simulations with many interacting agents more reliable.

video diffusionworld modelsmulti-agent controlaction bindingsubject state tokensspatial biasingMelting Pot benchmarkautoregressive trackinggenerative video games
Authors
Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov, Fabio Pizzati, Aliaksandr Siarohin
Abstract
Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.