From Self-Supervised Speech Models to Mixture-of-Experts for Robust Anti-Spoofing

2026-06-12 • Sound

SoundArtificial Intelligence

AI summaryⓘ

The authors focus on improving systems that detect synthetic or fake speech, which is getting harder as the speech sounds more natural. They change a type of speech representation model by adding multiple specialized components called experts, which can learn different speech patterns better. This new design helps the system recognize fake speech methods it hasn't seen before. Testing showed that their method improved detection accuracy on many datasets compared to the original model.

speech synthesisspoofing detectionself-supervised learningMixture-of-Expertsencoder layersgating mechanismacoustic patternsequal error ratenaturalness in speechgeneralization

Authors

Hugo Daumain, Driss Matrouf, Khaled Khelif, Mickael Rouvier

Abstract

Recent advances in speech generation have significantly improved the naturalness of synthetic speech, making spoofing detection increasingly challenging. A key limitation of current anti-spoofing systems is their limited robustness to unseen synthesis methods. In this work, we transform a self-supervised speech representation model into a Mixture-of-Experts (MoE) architecture to improve generalization. Feed-forward blocks in selected encoder layers are replaced by multiple expert networks controlled by a layer-wise gating mechanism, allowing experts to capture complementary acoustic patterns while preserving the representations learned during self-supervised pretraining. We further analyze the architectural choices affecting the performance of this MoE conversion and investigate the activation behavior of the experts. The proposed approach is evaluated on 14 spoofing datasets and reduces the macro EER from 5.46% to 4.81%, corresponding to 11.9% relative improvement over the baseline.

View PDFOpen arXiv