QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards

2026-06-02 • Computation and Language

Computation and LanguageArtificial Intelligence

AI summaryⓘ

The authors study a way to improve reinforcement learning that uses rubrics (detailed guidelines) for judging answers, but found that just fixing the questions and improving rubrics doesn't work well. They propose QUBRIC, a method that designs questions and rubrics together to create clearer, testable tasks. Their system rewrites vague questions into specific scenarios and filters out examples that don’t help learning. This approach improved performance on a difficult task and generalizes to other areas like legal and moral reasoning. Their work shows that matching questions with rubrics can make learning from complex feedback more practical.

reinforcement learningrubric-based reinforcement learningquery distributionscenario-based questionscontrastive rubric generationteacher-policy gapsGRPO traininginstruction-following datatransfer learningreasoning tasks

Authors

Rongzhi Zhang, Rui Feng, Zhihan Zhang, Jingfeng Yang, Qingyu Yin, Xin Liu, Zixuan Zhang, Priyanka Nigam, Bing Yin, Tuo Zhao, Chao Zhang

Abstract

Rubric-based RL is a promising route for extending reinforcement learning beyond verifiable rewards, yet existing methods optimize rubrics while treating the query distribution as fixed. We identify a structural bottleneck: rubric quality is constrained by query structure. Open-ended queries yield vague rubrics; naively narrowing them introduces fabricated references that no model can verify, so all responses fail and training receives no reward signal. We present QUBRIC, a framework that co-designs queries and rubrics. Teacher-derived key points ground the rewriting of open-ended queries into scenario-based, evaluable questions. Contrastive rubric generation then turns teacher-policy gaps into query-level criteria, and learnability filtering retains only informative query-rubric pairs for GRPO training. QUBRIC achieves a +5.5 point gain on ArenaHard over the SFT baseline. Trained only on instruction-following data, it further transfers to three held-out benchmarks spanning legal, moral, and narrative reasoning (+6.3 points on average), with improvements concentrated in reasoning-related dimensions. These results provide evidence that co-designing queries and rubrics can make rubric-based RL a practical complement to RLVR beyond strictly verifiable tasks.

View PDFOpen arXiv