Seeing Fast and Slow: Learning the Flow of Time in Videos

2026-04-23Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceGraphics
AI summary

The authors explored how to understand and change the speed of videos, like figuring out if a video is sped up or slowed down. They developed computer models that learn to detect these speed changes without needing extra labels and used these models to create a large collection of slow-motion videos from everyday sources. With this data, they built tools that can generate videos at different speeds and improve low-quality videos by adding smooth, detailed motion. Their work treats time as something that can be learned and controlled in videos, which could help in creating better video editing tools and detecting video speed manipulations.

video speed detectionself-supervised learningtemporal structureslow-motion videospeed-conditioned video generationtemporal super-resolutionhigh frame ratemotion detailvideo temporal control
Authors
Yen-Siang Wu, Rundong Luo, Jingsen Zhu, Tao Tu, Ali Farhadi, Matthew Wallingford, Yu-Chiang Frank Wang, Steve Marschner, Wei-Chiu Ma
Abstract
How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.