Selective Attention System (SAS): Device-Addressed Speech Detection for Real-Time On-Device Voice AI
2026-04-09 • Sound
SoundArtificial Intelligence
AI summaryⓘ
The authors study how to detect which device is being spoken to in conversations with multiple speakers, especially on devices with limited computing power that must decide quickly whether to process audio. They propose treating this as a sequence problem that uses recent conversation history, rather than just analyzing each utterance individually. Their method, called the Selective Attention System (SAS), works well on-device and improves performance further when combined with video. They also find that using recent interaction history is very important for accurate detection.
device-addressed speech detectionsequential routingutterance-local classificationinteraction historyon-device processingaudio-video fusionlow-latency systemsARM Cortex-AF1 scorespeech recognition preprocessing
Authors
David Joohun Kim, Daniyal Anjum, Bonny Banerjee, Omar Abbasi
Abstract
We study device-addressed speech detection under pre-ASR edge deployment constraints, where systems must decide whether to forward audio before transcription under strict latency and compute limits. We show that, in multi-speaker environments with temporally ambiguous utterances, this task is more effectively modelled as a sequential routing problem over interaction history than as an utterance-local classification task. We formalize this as Sequential Device-Addressed Routing (SDAR) and present the Selective Attention System (SAS), an on-device implementation that instantiates this formulation. On a held-out 60-hour multi-speaker English test set, the primary audio-only configuration achieves F1=0.86 (precision=0.89, recall=0.83); with an optional camera, audio+video fusion raises F1 to 0.95 (precision=0.97, recall=0.93). Removing causal interaction history (Stage~3) reduced F1 from 0.95 to 0.57+/-0.03 in the audio+video configuration under our evaluation protocol. Among the tested components, this was the largest observed ablation effect, indicating that short-horizon interaction history carries substantial decision-relevant information in the evaluated setting. SAS runs fully on-device on ARM Cortex-A class hardware (<150 ms latency, <20 MB footprint). All results are from internal evaluation on a proprietary dataset evaluated primarily in English; a 5-hour evaluation subset may be shared for independent verification (Section 8.8).