Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

2026-05-07Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceMachine LearningMultimedia
AI summary

The authors point out that research on Multimodal Domain Generalization (MMDG), which aims to make models work well across different data types and situations, lacks consistent testing standards, making it hard to judge true progress. They created MMDG-Bench, a new benchmark that tests many methods fairly across diverse tasks and conditions, including challenges like missing data and corrupted inputs. Their extensive experiments show that recent MMDG methods only slightly improve over basic approaches, no method is best in all cases, and there is still a big gap to ideal performance. They also find that using three data types together doesn't always beat using two, and most methods struggle when parts of the data are missing or corrupted. Overall, the authors demonstrate that MMDG is still a challenging problem needing better solutions.

Multimodal Domain GeneralizationBenchmarkAction RecognitionMechanical Fault DiagnosisSentiment AnalysisModel RobustnessCorruption RobustnessMissing ModalityOut-of-Distribution DetectionEmpirical Risk Minimization (ERM)
Authors
Hao Dong, Hongzhao Li, Shupan Li, Muhammad Haris Khan, Eleni Chatzi, Olga Fink
Abstract
Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current research is fragmented, with studies varying significantly across datasets, modality configurations, and experimental settings. Furthermore, existing benchmarks focus predominantly on action recognition, often neglecting critical real-world challenges such as input corruptions, missing modalities, and model trustworthiness. This lack of standardization obscures a reliable assessment of the field's advancement. To address this issue, we introduce MMDG-Bench, the first unified and comprehensive benchmark for MMDG, which standardizes evaluation across six datasets spanning three diverse tasks: action recognition, mechanical fault diagnosis, and sentiment analysis. MMDG-Bench encompasses six modality combinations, nine representative methods, and multiple evaluation settings. Beyond standard accuracy, it systematically assesses corruption robustness, missing-modality generalization, misclassification detection, and out-of-distribution detection. With 7, 402 neural networks trained in total across 95 unique cross-domain tasks, MMDG-Bench yields five key findings: (1) under fair comparisons, recent specialized MMDG methods offer only marginal improvements over ERM baseline; (2) no single method consistently outperforms others across datasets or modality combinations; (3) a substantial gap to upper-bound performance persists, indicating that MMDG remains far from solved; (4) trimodal fusion does not consistently outperform the strongest bimodal configurations; and (5) all evaluated methods exhibit significant degradation under corruption and missing-modality scenarios, with some methods further compromising model trustworthiness.