LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

2026-05-01Sound

SoundComputation and Language
AI summary

The authors found that existing speaker encoders have trouble recognizing the same person's voice consistently when the person speaks in different languages or scripts, especially with different accents. They created a new model called LASE that uses special training to focus on speaker identity while ignoring language differences. This approach made the encoding much more consistent across languages without needing a lot of training data. Their model performs well on tests and is released along with related data for others to use.

speaker encodervoice cloningcross-script recognitionaccent variationWavLMECAPA-TDNNcontrastive lossgradient reversal layermultilingual speechspeaker diarisation
Authors
Venkata Pushpak Teja Menta
Abstract
A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.