AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation
2026-04-09 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial IntelligenceComputation and Language
AI summaryⓘ
The authors created AVGen-Bench, a new test to better judge how well computer programs can turn text into matching video and sound together. They found that existing tests only check audio or video separately or use rough similarity scores. Their new method uses specialized tools and large multimodal language models to check both how good it looks and sounds, and how well it matches the meaning of the text. Their tests show that current systems look and sound nice but often fail at details like matching text, speech flow, understanding physical actions, and controlling music pitch.
Text-to-Audio-Video generationBenchmarkingMultimodal Large Language ModelsSemantic controllabilityPerceptual qualityAudio-visual aestheticsSpeech coherencePhysical reasoningMusical pitch control
Authors
Ziwei Zhou, Zeyuan Lai, Rui Wang, Yifan Yang, Zhen Xing, Yuqing Yang, Qi Dai, Lili Qiu, Chong Luo
Abstract
Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at http://aka.ms/avgenbench.