From Static to Dynamic: Exploring Self-supervised Image-to-Video Representation Transfer Learning

2026-03-27Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors study how to better use image-trained models for understanding videos. They find a trade-off between keeping objects consistent within the same video and distinguishing objects across different videos when fine-tuning. To solve this, they introduce Co-Settle, a method adding a simple layer on a frozen image model that balances these two needs using special training objectives. Their approach improves video tasks after short training without changing the main image model much. They also provide theory supporting why their method works well.

video representation learningimage-pretrained modelstemporal consistencysemantic separabilityfine-tuningself-supervised trainingprojection layertransfer learningtemporal cycle consistency
Authors
Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai, Xilin Zhao, Qingming Huang
Abstract
Recent studies have made notable progress in video representation learning by transferring image-pretrained models to video tasks, typically with complex temporal modules and video fine-tuning. However, fine-tuning heavy modules may compromise inter-video semantic separability, i.e., the essential ability to distinguish objects across videos. While reducing the tunable parameters hinders their intra-video temporal consistency, which is required for stable representations of the same object within a video. This dilemma indicates a potential trade-off between the intra-video temporal consistency and inter-video semantic separability during image-to-video transfer. To this end, we propose the Consistency-Separability Trade-off Transfer Learning (Co-Settle) framework, which applies a lightweight projection layer on top of the frozen image-pretrained encoder to adjust representation space with a temporal cycle consistency objective and a semantic separability constraint. We further provide a theoretical support showing that the optimized projection yields a better trade-off between the two properties under appropriate conditions. Experiments on eight image-pretrained models demonstrate consistent improvements across multiple levels of video tasks with only five epochs of self-supervised training. The code is available at https://github.com/yafeng19/Co-Settle.