VoiceAgengRAG: Solving the RAG Latency Bottleneck in Real-Time Voice Agents Using Dual-Agent Architectures

2026-03-02 • Sound

Sound

AI summaryⓘ

The authors created VoiceAgentRAG, a system with two parts that work together to answer questions faster. One part, the Slow Thinker, thinks ahead and finds useful information before it's needed. The other part, the Fast Talker, quickly uses this pre-found information to respond without searching again. This setup helps make conversation replies much quicker by using a special memory cache.

dual-agent systemretrieval-augmented generationlanguage model (LLM)semantic cacheFAISSvector databaseresponse generationdocument retrievalcache hitsprediction

Authors

Jielin Qiu, Jianguo Zhang, Zixiang Chen, Liangwei Yang, Ming Zhu, Juntao Tan, Haolin Chen, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming, Xiong, Silvio Savarese, Huan Wang

Abstract

We present VoiceAgentRAG, an open-source dual-agent memory router that decouples retrieval from response generation. A background Slow Thinker agent continuously monitors the conversation stream, predicts likely follow-up topics using an LLM, and pre-fetches relevant document chunks into a FAISS-backed semantic cache. A foreground Fast Talker agent reads only from this sub-millisecond cache, bypassing the vector database entirely on cache hits.

View PDFOpen arXiv