Skill Is Not Document: A Query-Conditional Benchmark and Two-Stage Retriever for LLM Agent Skill Routing
2026-06-02 • Information Retrieval
Information Retrieval
AI summaryⓘ
The authors focus on improving how AI agents pick multiple skills to complete tasks, highlighting that simply choosing skills based on individual relevance isn't enough because skills must work well together. They use the AI's own decisions about which skill combinations to reject as a helpful signal, something usually ignored before. To test their idea, they created a new bilingual benchmark called R3-Skill and developed a two-step system that explicitly learns skill compatibility. Their approach improved skill selection accuracy and they have shared their data and code for others to use.
LLM agentsskill retrievalsemantic relevanceskill compatibilitydata synthesiscross-encoderbi-encoderbenchmark datasetHit@1NDCG
Authors
Zifei Wang, Wei Wen, Qiang Ji, Ruizhi Qiao, Xing Sun
Abstract
LLM agents complete complex tasks by composing multiple skills, and skill retrieval is a front-end stage for agents. Skill retrieval differs fundamentally from traditional document retrieval at the supervision level: top-K joint correctness depends not only on the semantic relevance of each individual query-skill pair, but also on whether the skills retrieved together can collaborate to fulfill the task under the given query. Such "skill compatibility" cannot be derived from independent relevance alone. Yet existing LLM-based data synthesis pipelines can produce a direct supervision signal for "which skills should not be jointly retrieved under this query" -- namely the LLM's own rejection decisions -- and this signal is routinely discarded as low-quality data. To address this gap, we propose Reject-as-Resource Retriever (R3) and construct R3-Skill, a bilingual (Chinese-English) skill retrieval benchmark targeting realistic agent skill routing. R3-Skill spans four language directions, features query phrasings close to real user requests, and is verified through multi-expert cross-checking. On R3-Skill, we build a two-stage retrieval system (R3-Embedding + R3-Reranker) with skill compatibility as an explicit training signal. Gradient analysis shows that the "push-away" signal is diluted by bilateral balancing in the bi-encoder but acts as lossless graded ranking supervision in the cross-encoder -- motivating its placement at the cross-encoder stage, as confirmed by ablations on two datasets. The R3-Embedding + R3-Reranker pipeline attains Hit@1 = 0.7714, NDCG@10 = 0.8327 and Set-Compat = 0.3525 on R3-Skill. The dataset, training code and model weights are released as open source for agent skill routing.