AnyScene: Towards Highly Controllable Driving Scene Generation at Anywhere and Beyond

2026-05-25Robotics

RoboticsComputer Vision and Pattern Recognition
AI summary

The authors propose AnyScene, a new method to create detailed, controllable driving scenes and videos from bird's-eye view (BEV) layouts. Their approach uses a smart model that combines spatial and time information to generate accurate occupancy maps, which represent where things are in a scene. Based on these maps, another part of the system creates consistent driving videos from multiple camera views without depending on fixed references. Their experiments show that AnyScene works well on new and custom layouts and helps improve tasks like 3D reconstruction.

occupancy mapbird's-eye view (BEV)diffusion transformerautonomous drivingvideo synthesisspatial-temporal modeling3D reconstructionmulti-view videogenerative modeling
Authors
Haiming Zhang, Junfei Zhou, Feng Jiang, Jingzhong Li, Zhenglong Guo, Penglin Dai, Jifeng Dai, Yan Xie, Benjin Zhu
Abstract
Generating high-fidelity and controllable synthetic data is critical for advancing end-to-end autonomous driving, particularly for addressing the long tail of rare safety-critical scenarios. Existing occupancy-guided methods typically rely on shallow conditioning mechanisms and reference-frame-dependent video synthesis, which limits fine-grained controllability from arbitrary BEV layouts and restricts their applicability for scalable simulation. In this paper, we propose AnyScene, a unified occupancy-centric framework for driving scene generation. AnyScene generates semantic occupancy sequences from BEV layouts through a Spatial-Temporal Occupancy Diffusion Transformer that jointly tokenizes BEV and occupancy features in an autoregressive manner. This design enables precise controllability from cross-dataset and user-defined BEV inputs while naturally supporting long-horizon generation. Building upon the generated occupancy, a Geometry-Grounded View Expansion module treats occupancy as the canonical spatial representation and synthesizes temporally consistent multi-view driving videos in a reference-free and autoregressive fashion, supporting flexible camera configurations at inference time. Extensive experiments demonstrate that AnyScene achieves state-of-the-art performance in both occupancy and video generation. It exhibits strong generalization to unseen and customized layouts, and provides measurable benefits for downstream tasks such as sparse-view 3D reconstruction.