Robust Deepfake Detection: Mitigating Spatial Attention Drift via Calibrated Complementary Ensembles

2026-04-28Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors found that current deepfake detectors work well on clean data but struggle when images have problems like blur or heavy compression. To fix this, they built a system that trains on heavily degraded images to focus on important, unchanging features like shapes and meanings. Their model uses three special pathways to look at textures, faces, and combined information, which work together to reduce distractions and keep attention focused in the right areas. This approach made their detector more stable and reliable without extra training, ranking fourth in a major challenge.

deepfake detectionspatial attentioncompound degradationDINOv2-GiantCLIPScore-CAMcosine similarityzero-shot generalizationensemble learningNTIRE Challenge
Authors
Minh-Khoa Le-Phan, Minh-Hoang Le, Trong-Le Do, Minh-Triet Tran
Abstract
Current deepfake detection models achieve state-of-the-art performance on pristine academic datasets but suffer severe spatial attention drift under real-world compound degradations, such as blurring and severe lossy compression. To address this vulnerability, we propose a foundation-driven forensic framework that integrates an extreme compound degradation engine with a structurally constrained, multi-stream architecture. During training, our degradation pipeline systematically destroys high-frequency artifacts, optimizing the DINOv2-Giant backbone to extract invariant geometric and semantic priors. We then process images through three specialized pathways: a Global Texture stream, a Localized Facial stream, and a Hybrid Semantic Fusion stream incorporating CLIP. Through analyzing spatial attribution via Score-CAM and feature stability using Cosine Similarity, we quantitatively demonstrate that these streams extract non-redundant, complementary feature representations and stabilize attention entropy. By aggregating these predictions via a calibrated, discretized voting mechanism, our ensemble successfully suppresses background attention drift while acting as a robust geometric anchor. Our approach yields highly stable zero-shot generalization, achieving Fourth Place in the NTIRE 2026 Robust Deepfake Detection Challenge at CVPR. Code is available at https://github.com/khoalephanminh/ntire26-deepfake-challenge.