Saar-Voice: A Multi-Speaker Saarbrücken Dialect Speech Corpus

2026-04-13Computation and Language

Computation and Language
AI summary

The authors created Saar-Voice, a six-hour collection of speech recordings for the Saarbrücken German dialect, which is usually overlooked in language technology. They gathered text from digitized books and local sources, then recorded nine speakers reading parts of this text. The authors analyzed the dataset and discussed challenges like spelling differences and variations in how people speak. Their work aims to help future systems that convert text to speech, especially for dialects with limited data.

natural language processingspeech corpusdialectsSaarbrücken dialecttext-to-speechgrapheme-to-phoneme conversionlow-resource languagesorthographic variationspeaker variationzero-shot learning
Authors
Lena S. Oberkircher, Jesujoba O. Alabi, Dietrich Klakow, Jürgen Trouvain
Abstract
Natural language processing (NLP) and speech technologies have made significant progress in recent years; however, they remain largely focused on standardized language varieties. Dialects, despite their cultural significance and widespread use, are underrepresented in linguistic resources and computational models, resulting in performance disparities. To address this gap, we introduce Saar-Voice, a six-hour speech corpus for the Saarbrücken dialect of German. The dataset was created by first collecting text through digitized books and locally sourced materials. A subset of this text was recorded by nine speakers, and we conducted analyses on both the textual and speech components to assess the dataset's characteristics and quality. We discuss methodological challenges related to orthographic and speaker variation, and explore grapheme-to-phoneme (G2P) conversion. The resulting corpus provides aligned textual and audio representations. This serves as a foundation for future research on dialect-aware text-to-speech (TTS), particularly in low-resource scenarios, including zero-shot and few-shot model adaptation.