Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning
2026-03-06 • Computation and Language
Computation and Language
AI summaryⓘ
The authors developed a new speech recognition system that can understand multiple languages and different accents by using context from previous conversations. They combined parts of existing speech and language models in a way that keeps them separate but works well together. Their method uses a special training step to make speech and context features match up better. Tests on many hours of real conversations showed their approach improves transcription accuracy by over 5%. This work shows that using context and linking speech with language information helps make multilingual speech recognition better.
automatic speech recognitionmultilingual ASRcontext-aware ASRpretrained modelscontrastive learningspeech encoderlanguage modelcross-modal alignmentembedding spacedialogue history
Authors
Yuchen Zhang, Haralambos Mouratidis, Ravi Shekhar
Abstract
Automatic speech recognition (ASR) has benefited from advances in pretrained speech and language models, yet most systems remain constrained to monolingual settings and short, isolated utterances. While recent efforts in context-aware ASR show promise, two key challenges persist: limited multilingual support and the absence of principled alignment between speech and contextual representations. In this paper, we introduce a context-aware multilingual ASR framework that supports diverse languages and accents while preserving the modularity of pretrained models. Our approach combines a frozen speech encoder and a decoder-only language model via a lightweight projection module, allowing structured context prompts, including dialogue history and biasing words, to guide transcription. To improve interaction between speech and context, we employ a contrastive learning objective that aligns their representations in a shared embedding space. Evaluations on over 1,500 hours of real-world conversational speech across 11 languages and 5 English dialects show that contextual input consistently improves recognition quality. Contrastive alignment provides additional gains when applied to different context types, with an overall performance gain of over 5%. These results highlight the importance of both contextual modeling and cross-modal alignment in multilingual ASR.