$τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge

2026-03-04Artificial Intelligence

Artificial IntelligenceComputation and LanguageInformation Retrieval
AI summary

The authors created a new test called τ-Knowledge to check how well conversational agents can use lots of complicated, real-world information and tools together, especially in long conversations. They focused on a banking support scenario where the agent must handle many linked documents and update accounts correctly. Even the best current models only succeeded about 25% of the time and often got worse after repeated tries. The authors show this test is helpful for improving agents that need to understand and act on complex, unstructured knowledge during live interactions.

conversational agentsknowledge retrievaltool useτ-Benchfintechcustomer support workflowsembedding-based retrievalpolicy complianceunstructured datalong-horizon interactions
Authors
Quan Shi, Alexandra Zytek, Pedram Razavi, Karthik Narasimhan, Victor Barres
Abstract
Conversational agents are increasingly deployed in knowledge-intensive settings, where correct behavior depends on retrieving and applying domain-specific knowledge from large, proprietary, and unstructured corpora during live interactions with users. Yet most existing benchmarks evaluate retrieval or tool use independently of each other, creating a gap in realistic, fully agentic evaluation over unstructured data in long-horizon interactions. We introduce $τ$-Knowledge, an extension of $τ$-Bench for evaluating agents in environments where success depends on coordinating external, natural-language knowledge with tool outputs to produce verifiable, policy-compliant state changes. Our new domain, $τ$-Banking, models realistic fintech customer support workflows in which agents must navigate roughly 700 interconnected knowledge documents while executing tool-mediated account updates. Across embedding-based retrieval and terminal-based search, even frontier models with high reasoning budgets achieve only $\sim$25.5% pass^1, with reliability degrading sharply over repeated trials. Agents struggle to retrieve the correct documents from densely interlinked knowledge bases and to reason accurately over complex internal policies. Overall, $τ$-Knowledge provides a realistic testbed for developing agents that integrate unstructured knowledge in human-facing deployments.