FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control

2026-02-13Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionGraphics
AI summary

The authors created FlexAM, a new method to better control video generation by separating 'appearance' (how things look) from 'motion' (how things move). They use a special 3D signal represented as a point cloud that helps the model understand motion details and depth. This lets their system handle different tasks like editing videos from images or controlling the camera view more precisely. Their experiments show FlexAM works better than previous methods across these tasks.

video generationappearance-motion disentanglementpoint cloud3D control signalpositional encodingmulti-frequency encodingdepth-aware encodingI2V editingV2V editingcamera control
Authors
Mingzhi Sheng, Zekai Gu, Peng Li, Cheng Lin, Hao-Xiang Guo, Ying-Cong Chen, Yuan Liu
Abstract
Effective and generalizable control in video generation remains a significant challenge. While many methods rely on ambiguous or task-specific signals, we argue that a fundamental disentanglement of "appearance" and "motion" provides a more robust and scalable pathway. We propose FlexAM, a unified framework built upon a novel 3D control signal. This signal represents video dynamics as a point cloud, introducing three key enhancements: multi-frequency positional encoding to distinguish fine-grained motion, depth-aware positional encoding, and a flexible control signal for balancing precision and generative quality. This representation allows FlexAM to effectively disentangle appearance and motion, enabling a wide range of tasks including I2V/V2V editing, camera control, and spatial object editing. Extensive experiments demonstrate that FlexAM achieves superior performance across all evaluated tasks.