MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation

2026-05-08Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors present MoCoTalk, a system that creates talking-head videos by combining four different types of input: a reference image, facial keypoints, 3D face shading meshes, and speech audio. They introduce a special router that smartly mixes these inputs depending on the stage of video generation to avoid conflicts between them. They also design a new 3D-based way to represent and control mouth and facial movements separately from head motion, improving how speech-related expressions are captured. Their experiments show MoCoTalk performs better than previous methods and allows fine control over facial attributes.

talking-head generationfacial keypoints3D morphable models (3DMM)video diffusion modelsmulti-condition fusionaudio-visual alignmentfacial expression modelingmotion controlspeech-driven animationlip synchronization
Authors
Xinyan Ye, Jiankang Deng, Abbas Edalat
Abstract
Talking-head generation requires joint modeling of identity, head pose, facial expression, and mouth dynamics. Existing methods typically address only a subset of these factors, and rely on fixed-weight or heuristic fusion when multiple conditions are involved. We present MoCoTalk, a multi-conditional video diffusion framework that unifies four complementary control signals: a reference image, facial keypoints, 3DMM-rendered shading meshes, and the corresponding speech audio. To resolve destructive interference among heterogeneous conditions, we introduce an Adaptive Multi-Condition Router that computes channel-wise, timestep-aware gating over the four condition streams, allowing the fusion strategy to vary with both feature subspace and noise level. To better capture speech-related facial dynamics, we design a Mouth-Augmented Shading Mesh, a 3DMM-based representation that decouples head motion, mouth motion, expression, and lighting. This design provides a temporally consistent geometric prior and allows flexible recombination of these attributes at inference. We further introduce a lip consistency loss to tighten audio-visual alignment. Extensive experiments show that MoCoTalk achieves state-of-the-art performance on the majority of structural, motion, and perceptual metrics, while offering attribute-level controllability that single-condition methods do not provide.