DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents
2026-06-15 • Computation and Language
Computation and Language
AI summaryⓘ
The authors developed DeepRubric, a new way to train AI agents that write detailed reports by better matching what the AI is asked to do with how it is evaluated. Instead of guessing what criteria to use for judging the reports from the question, DeepRubric starts by defining exact points the report should cover using an evidence tree. This method creates clear questions and rubrics for training, making the AI learn more efficiently and accurately. Their approach reduced the training time significantly while maintaining strong performance on multiple tests.
Deep research agentsReinforcement learningRubric-based rewardsLarge Language Models (LLMs)Evidence treeQuery-rubric pairsGRPO (Generalized Reward Policy Optimization)Atomic evaluationRecursive decompositionReport quality assessment
Authors
Minghang Zhu, Chuyang Wei, Junhao Xu, Yilin Cheng, Zhumin Chen, Jiyan He
Abstract
Deep research agents synthesize long-form reports by searching and reasoning over retrieved evidence. Reinforcement learning with rubric-based rewards improves these agents by optimizing them against checkable criteria that translate report quality into reward signals, but its efficiency depends on whether those criteria reliably capture the task scope and evidence needs. Most existing studies ask an LLM to generate rubrics for a given query, but when the model fails to infer the underlying information needs, the generated rubrics may be incomplete and reduce RL efficiency. To obtain more reliable query--rubric supervision, we introduce DeepRubric, a data construction framework that reverses this process: instead of inferring evaluation criteria for a given query, it first determines what an evidence-backed report should be evaluated on and then synthesizes aligned query--rubric pairs from those evaluation targets. Starting from a sampled seed topic, DeepRubric builds an evidence tree by recursively expanding evidence-backed sub-questions, whose leaves serve as atomic and verifiable evaluation targets. It then uses the evidence tree to synthesize the training query and rubrics, ensuring that the reward evaluates exactly the information requested by the query. Using DeepRubric, we construct 9K query--rubric supervision examples and train DeepRubric-8B with rubric-based GRPO, achieving comparable performance to prior open state-of-the-art deep research models across three benchmarks with roughly 13x fewer RL GPU-hours.