Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

2026-04-15 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors study problems in audio-visual language models where the model wrongly guesses sounds just based on the video, ignoring the real audio. To fix this, they created a method called Audio-Contrastive Preference Optimization (ACPO), which teaches the model to pay attention to true audio and not just visual clues. Their approach uses two contrasting tasks to reduce these audio mistakes without losing the model’s overall ability to understand both sound and video together. Experiments show that their method helps the model better match sounds to what actually happens in the video.

Audio-Visual Language ModelsCross-Modal HallucinationAudio HallucinationContrastive LearningMultimodal LearningAudio GroundingPreference OptimizationVisual Dominance

Authors

Ami Baid, Zihui Xue, Kristen Grauman

Abstract

While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is video-driven audio hallucination: models routinely exploit visual shortcuts to hallucinate expected sounds, discarding true auditory evidence. To counteract this deeply ingrained visual dominance, we propose Audio-Contrastive Preference Optimization (ACPO). This dual-axis preference learning framework introduces an output-contrastive objective to penalize visual descriptions masquerading as audio facts, alongside an input-contrastive objective that swaps audio tracks to explicitly penalize generation invariant to the true auditory signal. Extensive experiments demonstrate that ACPO establishes highly faithful audio grounding and mitigates audio hallucination without compromising overarching multimodal capabilities.

View PDFOpen arXiv