Mitigating Proxy-to-Wild Domain Gap in Deepfake Speech

2026-06-05Sound

Sound
AI summary

The authors looked at a new way to detect fake speech made by advanced audio codecs, which sounds very real and is hard to spot. They noticed that using resynthesized speech as training data helps but doesn't always work well on new, unseen fakes. To fix this, they created a method called Domain-Shift Feature Augmentation (DSFA) that adds variation during training to better mimic real-world differences. They also made a tough new test dataset called CoSG ExtEval to check how well detectors generalize. Their method improved performance on this harder test, helping to better catch fake speech from many different models.

neural audio codecspeech generationdeepfake detectionproxy datadomain adaptationfeature augmentationself-supervised learninggeneralizationevaluation datasetaudio synthesis
Authors
Xuanjun Chen, Yun-Shing Wu, Wei-Chung Lu, Claire Lin, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang
Abstract
Recent neural audio codec-based speech generation (CodecFake) produces highly realistic audio, posing a challenge to existing deepfake countermeasure models. While using codec resynthesized speech (CoRS) as proxy data improves performance, it often suffers from limited generalization. We propose Domain-Shift Feature Augmentation (DSFA), which simulates "in-the-wild" variations by transforming deterministic feature statistics into stochastic distributions during fine-tuning. To evaluate generalization, we further introduce Codec-based Speech Generation Extension Evaluation (CoSG ExtEval) dataset, a more challenging extension of the CoSG Eval (from CodecFake+) dataset, featuring 40 unseen generative models and long-form audio. Experimental results demonstrate that combining a post-trained SSL backbone with DSFA effectively narrows the proxy-to-wild domain gap. This approach achieves state-of-the-art performance across diverse CodecFake attacks in both CoSG Eval and CoSG ExtEval.