MAJIC: Leveraging Articulatory Motion for Speech-based Emotion Recognition

2026-06-16Human-Computer Interaction

Human-Computer Interaction
AI summary

The authors developed MAJIC, a system that recognizes emotions in speech by combining audio signals with jaw and facial muscle movements. Unlike many systems that only use audio and struggle with subtle emotions, MAJIC uses both sound and how the face moves to better understand feelings. They tested it on people speaking in 10 different languages and in various situations, showing it works well across many conditions. Their system performed better than other methods that rely only on audio.

speech emotion recognitionarticulatory motionjaw movementfacial musclesmulti-task learningpitchprosodymultimodal systemsemotion classificationF1 score
Authors
Tanmay Srivastava, Paras Bhavnani, Benjir Alvee Islam, Shubham Jain
Abstract
We introduce MAJIC, a multimodal emotion recognition system that leverages articulatory motion of the jaw and facial muscles for speech-based emotion recognition (SER). While most SER systems perform well on datasets with strongly expressed emotional speech of trained actors, their performance often degrades when emotional expressions become more subtle. We explore this challenge by engineering features from articulatory motion and integrating them with audio features using a multi-task learning framework. Our key insight is that emotion in speech manifests not only through vocal characteristics but also through distinct articulatory motions: jaw movements, facial muscle vibrations, and speech-induced vibrations. While audio captures features such as pitch and prosody, articulatory motion contains complementary information that is not present in audio alone. We evaluate our system on data collected from 20 participants across multiple sessions, 10 languages, and diverse scenarios, including prompted and conversational speech, showing its robustness across users and settings. MAJIC achieves 93% accuracy and 91% F1 score for emotion classification, outperforming strong audio-based baselines on our dataset.