"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most

2026-02-12Artificial Intelligence

Artificial IntelligenceComputation and LanguageComputers and Society
AI summary

The authors found that speech recognition systems often make a lot of mistakes when trying to understand street names spoken by people in the U.S., especially if English is not their first language. They tested 15 popular models and saw that almost half of the transcriptions were wrong. These errors cause bigger problems for people who mainly speak languages other than English. To fix this, the authors created fake speech samples with different pronunciations to teach the models better. After training with these examples, the models got much better at understanding street names for non-English speakers.

speech recognitiontranscription error ratenamed entitiestext-to-speechfine-tuningsynthetic datalinguistic diversitynon-English speakersspeech system reliability
Authors
Kaitlyn Zhou, Martijn Bartelds, Federico Bianchi, James Zou
Abstract
Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.