Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning
2026-02-23 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors introduce Flow3r, a new method to help computers understand 3D and 4D scenes using only regular video, without needing expensive 3D data. They teach the model to predict movement between video frames by splitting the task into parts involving scene shape and camera movement. This approach helps the system learn better about how scenes change, especially for dynamic, real-world videos where labeled data is rare. Their tests show that Flow3r performs very well across many different scenarios by using lots of unlabeled video data.
3D reconstruction4D reconstructionmonocular videodense 2D correspondencesoptical flowpose estimationscene geometrydynamic scenesself-supervised learningvisual geometry learning
Authors
Zhongxiao Cong, Qitao Zhao, Minsik Jeon, Shubham Tulsiani
Abstract
Current feed-forward 3D/4D reconstruction systems rely on dense geometry and pose supervision -- expensive to obtain at scale and particularly scarce for dynamic real-world scenes. We present Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow') as supervision, enabling scalable training from unlabeled monocular videos. Our key insight is that the flow prediction module should be factored: predicting flow between two images using geometry latents from one and pose latents from the other. This factorization directly guides the learning of both scene geometry and camera motion, and naturally extends to dynamic scenes. In controlled experiments, we show that factored flow prediction outperforms alternative designs and that performance scales consistently with unlabeled data. Integrating factored flow into existing visual geometry architectures and training with ${\sim}800$K unlabeled videos, Flow3r achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes, with its largest gains on in-the-wild dynamic videos where labeled data is most scarce.