Unmixing the Crowd: Learning Mixture-to-Set Speaker Embeddings for Enrollment-Free Target Speech Extraction

2026-04-03Sound

Sound
AI summary

The authors developed a way to pick out individual voices from noisy audio without needing a clean example of each speaker’s voice beforehand. Their method guesses small voice-identifying features directly from the mixed-up sounds, using a special training approach. These features group nicely by speaker and work better than some existing methods. When used to help separate voices, they improve clarity and work well even on real-world noisy recordings.

target speech extractionspeaker embeddingpermutation-invariant trainingLibriMixWavLMclusteringnoise robustnessDNS Challengevoice separationobjective speech quality
Authors
FNU Sidharth, Meysam Asgari, Hao-Wen Dong, Dhruv Jain
Abstract
Personalized or target speech extraction (TSE) typically needs a clean enrollment -- hard to obtain in real-world crowded environments. We remove the essential need for enrollment by predicting, from the mixture itself, a small set of per-speaker embeddings that serve as the control signal for extraction. Our model maps a noisy mixture directly to a small set of candidate speaker embeddings trained to align with a strong single-speaker speaker-embedding space via permutation-invariant teacher supervision. On noisy LibriMix, the resulting embeddings form a structured and clusterable identity space, outperforming WavLM+K-means and separation-derived embeddings in standard clustering metrics. Conditioning these embeddings into multiple extraction back-ends consistently improves objective quality and intelligibility, and generalizes to real DNS-Challenge recordings.