ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

2026-05-07 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors introduce ActCam, a method for creating videos where both the actor's movements and the camera's viewpoint can be controlled in detail. Their approach uses existing image-to-video diffusion models and adds a way to keep the motion and scene depth consistent across frames. By carefully guiding the generation process in two stages, ActCam improves how well the generated videos match camera angles and character motions, especially with big viewpoint changes. This method works without needing new training and performs better than previous controls in tests and human evaluations.

video generationdiffusion modelspose estimationscene depthcamera trajectoryimage-to-video synthesismotion transferdenoising processzero-shot learning

Authors

Omar El Khalifi, Thomas Rossi, Oscar Fossey, Thibault Fouque, Ulysse Mizrahi, Philip Torr, Ivan Laptev, Fabio Pizzati, Baptiste Bellot-Gurlet

Abstract

For artistic applications, video generation requires fine-grained control over both performance and cinematography, i.e., the actor's motion and the camera trajectory. We present ActCam, a zero-shot method for video generation that jointly transfers character motion from a driving video into a new scene and enables per-frame control of intrinsic and extrinsic camera parameters. ActCam builds on any pretrained image-to-video diffusion model that accepts conditioning in terms of scene depth and character pose. Given a source video with a moving character and a target camera motion, ActCam generates pose and depth conditions that remain geometrically consistent across frames. We then run a single sampling process with a two-phase conditioning schedule: early denoising steps condition on both pose and sparse depth to enforce scene structure, after which depth is dropped and pose-only guidance refines high-frequency details without over-constraining the generation. We evaluate ActCam on multiple benchmarks spanning diverse character motions and challenging viewpoint changes. We find that, compared to pose-only control and other pose and camera methods, ActCam improves camera adherence and motion fidelity, and is preferred in human evaluations, especially under large viewpoint changes. Our results highlight that careful camera-consistent conditioning and staged guidance can enable strong joint camera and motion control without training. Project page: https://elkhomar.github.io/actcam/.

View PDFOpen arXiv