TETO: Tracking Events with Teacher Observation for Motion Estimation and Frame Interpolation

2026-03-24Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed a new method called TETO that helps event cameras track motion better using only a small amount of real-world video instead of lots of fake data. They use a smart way to pick the best training examples and learn from an existing RGB video tracker to improve motion estimation. Their system predicts how points move and the flow of pixels over time, which then helps improve video frame interpolation. Their approach works well on several benchmarks, achieving strong results with much less training data.

event camerasmotion estimationknowledge distillationRGB trackeroptical flowframe interpolationego-motionvideo diffusion transformersynthetic datateacher-student framework
Authors
Jini Yang, Eunbeen Hong, Soowon Son, Hyunkoo Lee, Sunghwan Hong, Sunok Kim, Seungryong Kim
Abstract
Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that often suffers from a significant sim-to-real gap. We propose TETO (Tracking Events with Teacher Observation), a teacher-student framework that learns event motion estimation from only $\sim$25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve state-of-the-art point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.