Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning
2026-05-08 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors suggest breaking down rewards into multiple clear parts and using a large language model (LLM) to score each part, which gives more detailed feedback than a single overall score. They call this approach rubric-grounded reinforcement learning, where the policy learns from a multi-criteria reward judged by a fixed LLM using extra information it never directly sees. They tested this method with scientific documents and a specific training algorithm called GRPO, which improved the model's performance on new tasks requiring reasoning. This shows that structured rewards based on detailed rubrics can help models learn better and apply that learning to different problems.
reinforcement learninglarge language modelrubric-grounded RLpolicy optimizationGRPOmulti-criterion rewardheld-out evaluationtransfer learningscientific document corpusreasoning benchmarks
Authors
Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe, Scott Pakin, Dan O'Malley
Abstract
We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-grounded reinforcement learning (RL)}: a framework in which the policy is optimized against a structured, multi-criterion reward produced by a frozen LLM judge that conditions on auxiliary grounding the policy never sees. We instantiate the framework by deriving rubrics from an Office of Scientific and Technical Information (OSTI)-derived corpus of roughly 100,000 scientific and technical documents and training Llama-3.1-8B-Instruct with Group Relative Policy Optimization (GRPO). With GRPO-based training, the model achieves $71.7\%$ normalized reward on held-out rubric evaluation. The GRPO-tuned policy also improves over the base model on four reasoning benchmarks not derived from the training corpus -- GSM8K, MATH, GPQA Main, and GPQA Diamond. These results provide evidence that structured, document-grounded rewards can improve held-out rubric performance and induce transferable reasoning behaviors beyond the corpus used to construct the training environment.