CoPE-VideoLM: Codec Primitives For Efficient Video Language Models
2026-02-13 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial IntelligenceComputation and Language
AI summaryⓘ
The authors found that current video language models often miss important details because they only look at certain key frames and process full images each time, which is slow. To fix this, they use parts of the video compression data like motion vectors and residuals, which naturally capture changes and repeated information without needing to process every full image. They built lightweight transformer encoders to use this compressed data efficiently and trained them to work well with existing image features. Their method is much faster and uses fewer tokens, while still performing just as well or better on many different video tasks.
Video Language ModelsKeyframe SamplingMotion VectorsResidualsTransformer EncoderVideo CompressionTemporal ReasoningToken EfficiencyPre-trainingVideo Understanding Benchmarks
Authors
Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, Mihai Dusmanu
Abstract
Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to $86\%$ and token usage by up to $93\%$ compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on $14$ diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.