SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
2026-04-09 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors created a huge new video dataset called SceneScribe-1M that includes one million real-world videos with detailed text descriptions, camera details, depth maps, and 3D tracking information. This dataset helps computers learn both to understand 3D scenes in videos and to create new videos from text. They tested the dataset on tasks like figuring out depth from a single image, rebuilding scenes in 3D, tracking moving points, and making videos from text prompts. By sharing SceneScribe-1M publicly, the authors want to help researchers build better systems that can both see and make realistic 3D video content.
3D geometric perceptionvideo synthesisdepth estimationscene reconstructionpoint trackingtext-to-video synthesiscamera parametersmulti-modal datasetspatio-temporal information
Authors
Yunnan Wang, Kecheng Zheng, Jianyuan Wang, Minghao Chen, David Novotny, Christian Rupprecht, Yinghao Xu, Xing Zhu, Wenjun Zeng, Xin Jin, Yujun Shen
Abstract
The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.