Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention
2026-06-03 • Computation and Language
Computation and Language
AI summaryⓘ
The authors studied why speech-based language models (SLLMs) struggle with complex reasoning compared to text-based ones. They found that speech models do well on some tasks like spatial and factual reasoning but fail on logical tasks that require keeping track of entities. The problem is that speech makes it hard for models to connect entities with their properties during thought. To fix this, the authors created a method called Entity-Aware Chain-of-Thought that helps models explicitly link entities before reasoning, which improved accuracy significantly, even with speech recognition errors.
Speech Large Language ModelsText-to-Text ModelsEntity BindingChain-of-ThoughtSpatial ReasoningLogical ReasoningEntity TrackingSpeech RecognitionSemantic Binding
Authors
Ming-Hao Hsu, Xiaohai Tian, Jun Zhang, Zhizheng Wu
Abstract
Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this modality gap is not a uniform cognitive deficit. Evaluating three diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. However, on logical tasks requiring entity tracking, S2T accuracy collapses to chance. We diagnose this localized degradation as an entity binding failure: continuous speech features cause models to lose precise entity-property associations during implicit reasoning. To resolve this, we propose Entity-Aware Chain-of-Thought (EA-CoT), forcing SLLMs to explicitly enumerate entities and bind them to claims before reasoning. Strikingly, EA-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24.4% absolute accuracy improvement. Ablations confirm these gains stem entirely from explicit semantic binding, reframing the gap as a resolvable bottleneck.