Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text

2026-02-27 • Machine Learning

Machine Learning

AI summaryⓘ

The authors introduce Chunk-wise Attention Transducer (CHAT), an improved version of RNN-T models that processes audio in small chunks and uses attention within each chunk. This makes the model more efficient by reducing memory use and speeding up both training and inference. CHAT also improves accuracy in tasks like speech recognition and translation, especially where strict timing rules of RNN-T limit performance. Overall, the authors show CHAT is a practical way to build faster and better streaming speech systems without losing real-time operation.

RNN-Tstreaming speech recognitionchunk-wise processingcross-attentiontemporal dimensionword error rate (WER)speech translationBLEU scoremachine learning efficiency

Authors

Hainan Xu, Vladimir Bataev, Travis M. Bartley, Jagadeesh Balam

Abstract

We propose Chunk-wise Attention Transducer (CHAT), a novel extension to RNN-T models that processes audio in fixed-size chunks while employing cross-attention within each chunk. This hybrid approach maintains RNN-T's streaming capability while introducing controlled flexibility for local alignment modeling. CHAT significantly reduces the temporal dimension that RNN-T must handle, yielding substantial efficiency improvements: up to 46.2% reduction in peak training memory, up to 1.36X faster training, and up to 1.69X faster inference. Alongside these efficiency gains, CHAT achieves consistent accuracy improvements over RNN-T across multiple languages and tasks -- up to 6.3% relative WER reduction for speech recognition and up to 18.0% BLEU improvement for speech translation. The method proves particularly effective for speech translation, where RNN-T's strict monotonic alignment hurts performance. Our results demonstrate that the CHAT model offers a practical solution for deploying more capable streaming speech models without sacrificing real-time constraints.

View PDFOpen arXiv