A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition
2026-03-12 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors tackle the problem of recognizing facial emotions in difficult, real-world videos where faces may be blurry, at different angles, or moving. They create a two-step method that first uses a strong visual model with special tricks to better prepare images and improve classification. Then, they combine face visuals at different sizes with matching audio cues using a smart fusion technique and smooth the predictions over time. Their approach works better than existing methods on a standard emotion dataset. This shows combining sound and vision thoughtfully helps with tricky emotion recognition in videos.
facial expression recognitionDINOv2Vision Transformer (ViT)audio-visual fusionWav2Vec 2.0mixture-of-expertstemporal smoothingemotion classificationAffective Behavior Analysis in-the-Wild (ABAW)
Authors
Jiajun Sun, Zhe Gao
Abstract
This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.