SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
2026-04-09 • Artificial Intelligence
Artificial IntelligenceMachine Learning
AI summaryⓘ
The authors focus on improving how language models learn to think better by using a method called Reinforcement Learning with Verifiable Rewards (RLVR). They point out that current data for training these models isn’t rich enough for general reasoning skills like understanding cause and effect or time. To fix this, they created SUPERNOVA, a way to pick and mix expert-annotated data that helps models reason better. Their experiments show that carefully choosing which tasks to train on makes a big difference, and models trained with SUPERNOVA perform much better on tough reasoning tests. This work offers useful ideas for building better training sets to help AI reason smarter.
Reinforcement LearningVerifiable RewardsLarge Language ModelsGeneral ReasoningInstruction TuningData CurationCausal InferenceSynthetic InterventionsBenchmarkingTask Selection
Authors
Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8\% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.