From Zero to Hero: Training-Free Custom Concept Spawning in World Models

2026-06-01 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors identify a problem in video game-like AI models where new objects can’t be added by the user once the scene starts generating. They propose SPAWN, a method that lets users insert new visual elements, like characters or buildings, into a video by temporarily swapping in new concept information. This approach uses the model’s existing memory system to spread the new concept naturally through the video without retraining. Their tests show SPAWN keeps things like lighting and perspective consistent and works well with both images and text inputs.

autoregressive modelsworld modelsconcept spawningvideo generationmemory injectionreference frametemporal coherenceinteractive storytellinglatent spacecontext memory

Authors

Kiymet Akdemir, Pinar Yanardag

Abstract

Autoregressive world models have emerged as a powerful paradigm for interactive video generation, allowing users to navigate dynamically generated environments through actions. These models are typically conditioned on a text prompt and/or a single reference frame, from which the entire world is generated. Yet the moment the user navigates beyond what is visible in that frame, the unseen regions are populated by the base model's priors, with no mechanism for the user to specify what should appear and where. This is a fundamental limitation for applications such as gaming, interactive storytelling, and simulation, where controllable scene composition is essential. We refer to this missing capability as concept spawning; introducing a user-specified visual concept into a world model, analogous to spawning in a game engine. We introduce SPAWN (Swapping Pinned Anchor with Windowed iNjection), a training-free method for concept spawning. SPAWN exploits a structural property of image-to-video backbones: the first slot of the context memory is pinned to the reference frame and acts as a foundational anchor for every generated chunk. By swapping this anchor with an external concept latent over a short injection window and letting the original anchor return, we cause the concept to propagate naturally through the rollout via the model's own memory. SPAWN supports concepts from fine-grained entities such as characters and props to large-scale elements such as buildings and landmarks, and accepts either a concept image or a text description as input. Experiments show that SPAWN integrates concepts with consistent lighting, scale, and perspective while preserving identity and temporal coherence, demonstrating that controllable concept spawning is achievable in existing autoregressive world models without any training.

View PDFOpen arXiv