Geometry-Guided Camera Motion Understanding in VideoLLMs

2026-03-13Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors studied how well current video-based AI models understand camera movements, which are important for how we see and make movies. They created a new dataset with controlled camera motions and a test set to check model performance. They found existing models often miss details about camera movement because these signals are weakly recognized inside the models. To fix this, they designed a simple method that uses 3D models to detect camera motion and then helps AI models interpret these cues better without retraining. Their approach improves motion recognition and makes AI responses more aware of camera movements.

camera motionvision-language modelsVideoLLMmulti-label recognitionvision encoderViT blocks3D foundation modelsstructured promptingmotion recognitionbenchmarking
Authors
Haoan Feng, Sri Harsha Musunuri, Guan-Ming Su
Abstract
Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of $\textbf{benchmarking}$, $\textbf{diagnosis}$, and $\textbf{injection}$. We curate $\textbf{CameraMotionDataset}$, a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark--$\textbf{CameraMotionVQA}$. Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap without costly training or fine-tuning, we propose a lightweight, model-agnostic pipeline that extracts geometric camera cues from 3D foundation models (3DFMs), predicts constrained motion primitives with a temporal classifier, and injects them into downstream VideoLLM inference via structured prompting. Experiments demonstrate improved motion recognition and more camera-aware model responses, highlighting geometry-driven cue extraction and structured prompting as practical steps toward a camera-aware VideoLLM and VLA system. The dataset and benchmark is publicly available at https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark.