OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation

2026-04-13Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors study how to create videos showing humans interacting with objects based on different inputs like text, images, sounds, and poses. They introduce OmniShow, a new system that combines these inputs to generate high-quality videos while balancing control and output quality. To improve training with limited data, they propose a special training method that uses different datasets together. They also create a benchmark called HOIVG-Bench to better evaluate progress in this area. Their experiments show OmniShow performs very well compared to existing methods.

Human-Object InteractionVideo GenerationMultimodal ConditioningPose EstimationAudio-Visual SynchronizationModel Training StrategiesBenchmark DatasetContent CreationDeep LearningAttention Mechanisms
Authors
Donghao Zhou, Guisheng Liu, Hao Yang, Jiatong Li, Jingyu Lin, Xiaohu Huang, Yichen Liu, Xin Gao, Cunjian Chen, Shilei Wen, Chi-Wing Fu, Pheng-Ann Heng
Abstract
In this work, we study Human-Object Interaction Video Generation (HOIVG), which aims to synthesize high-quality human-object interaction videos conditioned on text, reference images, audio, and pose. This task holds significant practical value for automating content creation in real-world applications, such as e-commerce demonstrations, short video production, and interactive entertainment. However, existing approaches fail to accommodate all these requisite conditions. We present OmniShow, an end-to-end framework tailored for this practical yet challenging task, capable of harmonizing multimodal conditions and delivering industry-grade performance. To overcome the trade-off between controllability and quality, we introduce Unified Channel-wise Conditioning for efficient image and pose injection, and Gated Local-Context Attention to ensure precise audio-visual synchronization. To effectively address data scarcity, we develop a Decoupled-Then-Joint Training strategy that leverages a multi-stage training process with model merging to efficiently harness heterogeneous sub-task datasets. Furthermore, to fill the evaluation gap in this field, we establish HOIVG-Bench, a dedicated and comprehensive benchmark for HOIVG. Extensive experiments demonstrate that OmniShow achieves overall state-of-the-art performance across various multimodal conditioning settings, setting a solid standard for the emerging HOIVG task.