Learning What to Learn: Stage-Specific Data Sets for SFT-then-RL in Small Language Model Reasoning

2026-06-03Computation and Language

Computation and Language
AI summary

The authors studied how to better train small language models (SLMs) for reasoning by organizing the learning process into two stages: supervised fine-tuning (SFT) and reinforcement learning (RL). They suggest giving easier tasks to RL and harder ones to SFT, since each stage is good at different types of learning. They created a method that adjusts how guidance is given during training to make learning easier and also fixes mistakes from RL by feeding back corrections to SFT. Tests on several benchmarks showed their approach works better than previous methods. Overall, the authors emphasize that matching the difficulty of training data to each stage improves reasoning skills in SLMs.

Small Language ModelsSupervised Fine-Tuning (SFT)Reinforcement Learning (RL)Reasoning BenchmarksTraining Data DifficultyBridge MechanismCritique Fine-TuningKnowledge DistillationReward SignalsPost-Training
Authors
Chongyang He, Rui Zhang, Zixuan Wang, Xin Li
Abstract
Post-training Small Language Models (SLMs) for reasoning typically follows an SFT-then-RL pipeline, yet existing work rarely considers what data should be learned at each stage. We argue that data strategy should be aligned with the distinct roles of SFT and RL: SFT is better suited for acquiring not-yet-mastered reasoning skills, while RL is better suited for consolidating skills that the model can already partially access. Based on this principle, we propose a difficulty-aware SFT-then-RL framework that organizes training data into stage-specific sets. For hard samples in the SFT stage, we introduce a Bridge mechanism that transforms raw teacher-generated reasoning traces into more learnable supervision for SLMs. For hard samples that remain unsolved during RL, we apply Critique Fine-Tuning by converting all-zero-reward failures into diagnostic, repair, and new reasoning trace supervision for the next SFT stage. Experiments on two SLMs across five reasoning benchmarks show that our method consistently improves over representative SFT, distillation, and RL baselines. Our results highlight the importance of coordinating data difficulty across SFT and RL for effective SLM reasoning post-training.