MSCT: Differential Cross-Modal Attention for Deepfake Detection
2026-04-09 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionMultimedia
AI summaryⓘ
The authors address the problem of detecting fake videos by looking at both audio and visual clues. They point out that previous methods struggle to fully capture the important features and sometimes misalign audio with video. To fix this, they created a new model called the multi-scale cross-modal transformer (MSCT) that better combines audio and visual information at different scales. Their tests on a known deepfake dataset show that their approach works well.
deepfake detectionaudio-visual alignmentmulti-modal modeltransformer encoderself-attentioncross-modal attentionfeature extractionFakeAVCeleb dataset
Authors
Fangda Wei, Miao Liu, Yingxue Wang, Jing Wang, Shenghui Zhao, Nan Li
Abstract
Audio-visual deepfake detection typically employs a complementary multi-modal model to check the forgery traces in the video. These methods primarily extract forgery traces through audio-visual alignment, which results from the inconsistency between audio and video modalities. However, the traditional multi-modal forgery detection method has the problem of insufficient feature extraction and modal alignment deviation. To address this, we propose a multi-scale cross-modal transformer encoder (MSCT) for deepfake detection. Our approach includes a multi-scale self-attention to integrate the features of adjacent embeddings and a differential cross-modal attention to fuse multi-modal features. Our experiments demonstrate competitive performance on the FakeAVCeleb dataset, validating the effectiveness of the proposed structure.