MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation
2026-04-21 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors developed MMControl, a system that lets users control both video and audio together when generating content using a single model. Unlike previous methods that only controlled video, MMControl uses special ways to include visual and sound information like images, audio clips, depth maps, and body poses. This helps the model create videos with matching faces and voices while allowing users to adjust how much each type of control influences the final output. Their tests show that this approach gives fine control over things like character identity, voice, movements, and scene setup.
Diffusion Transformeraudio-video generationmulti-modal controlconditional injectionreference imagesreference audiodepth mapspose sequencesguidance scalingcross-modal alignment
Authors
Liyang Li, Wen Wang, Canyu Zhao, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen
Abstract
Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMControl introduces a dual-stream conditional injection mechanism. It incorporates both visual and acoustic control signals, including reference images, reference audio, depth maps, and pose sequences, into a joint generation process. These conditions are injected through bypass branches into a joint audio-video Diffusion Transformer, enabling the model to simultaneously generate identity-consistent video and timbre-consistent audio under structural constraints. Furthermore, we introduce modality-specific guidance scaling, which allows users to independently and dynamically adjust the influence strength of each visual and acoustic condition at inference time. Extensive experiments demonstrate that MMControl achieves fine-grained, composable control over character identity, voice timbre, body pose, and scene layout in joint audio-video generation.