QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples

2026-06-03Computation and Language

Computation and LanguageArtificial IntelligenceInformation Retrieval
AI summary

The authors created QO-Bench, a test designed to check how well systems answer questions based on specific operations (like joins and counts) using data extracted from text. They found that current methods often retrieve relevant text but lose important structured details needed for accurate answers. Their tests show that different systems work better depending on the type of operation required, and simply improving retrieval isn’t enough because executing the query correctly is a bigger challenge. QO-Bench shifts focus from just finding related passages to preserving the data needed for exact query operations.

retrieval-augmented generationquery-operatortyped event tuplesinformation extractiondatabase queriesjoinsintersectionrecalllong-context oraclesemantic relevance
Authors
Mengao Zhang, Xiang Yang, Chang Liu, Tianhui Tan, Ke-wei Huang
Abstract
Many real-world questions over business, legal, and scientific corpora are natural-language versions of database-style queries over records latent in text. Existing retrieval-augmented generation (RAG) systems are optimized primarily for semantic relevance, but retrieving plausible passages does not guarantee correct query execution. We introduce QO-Bench, a diagnostic benchmark for query-operator question answering over typed event tuples. The benchmark covers 22,984 news articles and 614 corporate events across 18 query templates, evaluated on 785 questions. Each gold answer is deterministically computed from typed event tuples and scored by recall, with answers matched to the gold tuples by exact match rather than an LLM judge. This design enables operator-level diagnosis such as joins and intersection. We evaluate RAG, ReAct RAG, GraphRAG, and information-extraction-to-SQL under matched conditions, with a long-context oracle ceiling to isolate retrieval failure. A two-axis framework -- index-time preservation versus query-time execution -- predicts where each paradigm fails, and the results bear it out: systems retrieve relevant text but discard the typed values operators need, and the deployable paradigm ranking inverts across operators, with similarity retrieval leading on filter/project and extraction-to-SQL on intersection and counting. Even given the gold evidence, a long-context oracle stays far from saturated, so operator execution -- not retrieval alone -- is a core bottleneck that a stronger answer model does not remove. QO-Bench reframes the goal from passage relevance to query-operator-preserving retrieval.