Natural Language Query to Configuration for Retrieval Agents
2026-05-26 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors study how to improve retrieval systems that have many settings affecting answer quality and cost. They introduce BRANE, a method that uses a language model to understand each query and predicts which configuration will give the best tradeoff between accuracy and cost for that query. Instead of fixing settings for all queries, BRANE picks the best setup every time, saving resources while maintaining accuracy. Tests show that BRANE works better than other methods and reduces cost significantly without losing quality.
retrieval agentslarge language models (LLM)pipeline configurationinference timecost-quality tradeoffper-query optimizationpredictive modelingMuSiQue datasetFinanceBench datasetPareto frontier
Authors
Melissa Z. Pan, Negar Arabzadeh, Mathew Jacob, Fiodar Kazhamiaka, Esha Choukse, Matei Zaharia
Abstract
Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose **BRANE**, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, **BRANE** selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, **BRANE** consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration's accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.