Scene-Centric Unsupervised Video Panoptic Segmentation
2026-06-03 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors introduce a new task called unsupervised video panoptic segmentation (VPS), which means detecting and tracking all objects in a video without any human-labeled data. They created a method called VideoCUPS that uses clues like depth and motion from videos to generate training labels automatically. By training their model with these labels and a new loss function, they achieved better performance than existing methods. They also provide a way to evaluate this task and several baseline models to help future research.
video panoptic segmentationunsupervised learningpseudo-labelsdepth estimationmotion cuessemantic segmentationinstance segmentationvideo trackingloss function
Authors
Christoph Reich, Oliver Hahn, Nikita Araslanov, Laura Leal-Taixé, Christian Rupprecht, Daniel Cremers, Stefan Roth
Abstract
Video panoptic segmentation (VPS) aims to jointly detect, segment, and track all objects while partitioning the video into semantically consistent regions. We introduce the task setting of unsupervised VPS, omitting any human supervision. Existing unsupervised scene understanding works mainly focused on image segmentation tasks; the video domain remains underexplored. We propose VideoCUPS, the first unsupervised VPS approach. VideoCUPS generates temporally consistent panoptic video pseudo-labels from scene-centric videos by exploiting unsupervised depth, motion, and visual cues. Training on these pseudo-labels using a novel Video DropLoss yields an accurate, unsupervised VPS model. To benchmark progress, we introduce a comprehensive evaluation protocol and four competitive baselines, extending state-of-the-art unsupervised panoptic image and instance video segmentation models to VPS. VideoCUPS outperforms all baselines and demonstrates strong label-efficient learning. With VideoCUPS, our evaluation protocol, and baselines, we provide a strong foundation for future research on unsupervised VPS.