A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

2026-02-26 • Computation and Language

Computation and Language

AI summaryⓘ

The authors created a system called MiSTER-E to recognize emotions in conversations by using both speech and text. Their method treats speech and text separately to understand context, then combines the information smartly to decide the emotion. They also use special techniques to make sure speech and text parts agree with each other without needing to know who is speaking. When tested on three emotion datasets, their system did better than other similar models. They also studied different parts of their system to show what made it work well.

Emotion Recognition in Conversations (ERC)Mixture-of-Experts (MoE)Multimodal FusionLarge Language Models (LLMs)Utterance-level EmbeddingConvolutional-Recurrent NetworksSupervised Contrastive LossKL-Divergence RegularizationWeighted F1-ScoreCross-Modal Learning

Authors

Soumya Dutta, Smruthi Balaji, Sriram Ganapathy

Abstract

Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.

View PDFOpen arXiv