Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

2026-05-08 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors suggest breaking down rewards into multiple clear parts and using a large language model (LLM) to score each part, which gives more detailed feedback than a single overall score. They call this approach rubric-grounded reinforcement learning, where the policy learns from a multi-criteria reward judged by a fixed LLM using extra information it never directly sees. They tested this method with scientific documents and a specific training algorithm called GRPO, which improved the model's performance on new tasks requiring reasoning. This shows that structured rewards based on detailed rubrics can help models learn better and apply that learning to different problems.

reinforcement learninglarge language modelrubric-grounded RLpolicy optimizationGRPOmulti-criterion rewardheld-out evaluationtransfer learningscientific document corpusreasoning benchmarks

Authors

Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe, Scott Pakin, Dan O'Malley

Abstract

We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-grounded reinforcement learning (RL)}: a framework in which the policy is optimized against a structured, multi-criterion reward produced by a frozen LLM judge that conditions on auxiliary grounding the policy never sees. We instantiate the framework by deriving rubrics from an Office of Scientific and Technical Information (OSTI)-derived corpus of roughly 100,000 scientific and technical documents and training Llama-3.1-8B-Instruct with Group Relative Policy Optimization (GRPO). With GRPO-based training, the model achieves $71.7\%$ normalized reward on held-out rubric evaluation. The GRPO-tuned policy also improves over the base model on four reasoning benchmarks not derived from the training corpus -- GSM8K, MATH, GPQA Main, and GPQA Diamond. These results provide evidence that structured, document-grounded rewards can improve held-out rubric performance and induce transferable reasoning behaviors beyond the corpus used to construct the training environment.

View PDFOpen arXiv