Real-Time Voice AI Hears but Does Not Listen

2026-06-24 • Computation and Language

Computation and Language

AI summaryⓘ

The authors tested four popular voice AI systems to see if they understand not just the words people say but also the emotions behind how they say them. They found that these systems mostly focus on the words and often ignore important emotional cues like crying or sarcasm when making decisions. Even when the systems recognize emotions, they tend to act as if those feelings don't matter. This means the systems might not be reliable in situations where how something is said is just as important as what is said. Attempts to make the systems pay more attention to tone only helped a little.

voice AIemotional intelligencespeech recognitionvocal deliverysarcasm detectionreal-time systemsaccent estimationtone of voiceOpenAI GPTGoogle Gemini

Authors

Martijn Bartelds, Federico Bianchi, James Zou

Abstract

Speech conveys information through both words and vocal delivery. We evaluate four leading production realtime voice systems-OpenAI's GPT Realtime 2, Google's Gemini 3.1 Flash Live, and Alibaba's Qwen3.5 Omni Plus and Omni Flash-on tasks where the words and the delivery patterns both convey meaningful information. Across three consequential scenarios, all four systems act on the words rather than the voice. They end calls with crying callers who insist nothing is wrong, approve wire transfers authorized in frightened voices, and enroll callers whose agreement is clearly sarcastic. Surprisingly, this is often not a failure of perception. When asked directly, three of the four systems reliably identify the distress, fear, or sarcasm they later ignore when making decisions. We observe a similar pattern when these realtime voice systems estimate accent and age, as their responses frequently follow the biases of the words rather than the acoustic properties of the speaker. We term this disconnect between perception and action the emotional intelligence gap of voice AI. Prompting systems to explicitly attend to vocal delivery improves performance only partially and inconsistently. Our findings show that current realtime voice AI systems often behave as if speech had been reduced to a transcript, suggesting that they should be used with caution in settings where the tone and emotion of delivery convey important information.

View PDFOpen arXiv