DeepFense: A Unified, Modular, and Extensible Framework for Robust Deepfake Audio Detection
2026-04-09 • Sound
Sound
AI summaryⓘ
The authors created DeepFense, a free toolkit that helps researchers detect fake speech using advanced techniques in one place. They tested over 400 different models and found that the type of pre-trained audio features used has the biggest effect on how well the models work. They also discovered that many top models are biased by factors like audio quality, the speaker's gender, and language. DeepFense aims to help improve fairness and performance when deploying speech deepfake detectors in real-world situations.
Speech deepfake detectionPyTorchPre-trained feature extractorCross-domain generalizationBias in machine learningAudio augmentationModel evaluationTraining dataDeep learningSpeech synthesis
Authors
Yassine El Kheir, Arnab Das, Yixuan Xiao, Xin Wang, Feidi Kallel, Enes Erdem Erdogan, Ngoc Thang Vu, Tim Polzehl, Sebastian Moeller
Abstract
Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, we present DeepFense, a comprehensive, open-source PyTorch toolkit integrating the latest architectures, loss functions, and augmentation pipelines, alongside over 100 recipes. Using DeepFense, we conducted a large-scale evaluation of more than 400 models. Our findings reveal that while carefully curated training data improves cross-domain generalization, the choice of pre-trained front-end feature extractor dominates overall performance variance. Crucially, we show severe biases in high-performing models regarding audio quality, speaker gender, and language. DeepFense is expected to facilitate real-world deployment with the necessary tools to address equitable training data selection and front-end fine-tuning.