Learning Process Rewards via Success Visitation Matching for Efficient RL

2026-06-22Machine Learning

Machine LearningArtificial IntelligenceRobotics
AI summary

The authors address the problem of learning from very rare rewards in reinforcement learning, where feedback is only given when a task is completed. They propose a method that trains a discriminator to tell apart successful and unsuccessful attempts, then uses this to create a more continuous, helpful reward signal based on how closely the policy's behavior matches successful attempts. This gives the learning process better guidance at every step without changing what the best solution is. Their experiments on robotic tasks show that this method speeds up learning compared to using only the sparse success reward.

reinforcement learningsparse rewardsreward shapingcredit assignmentdiscriminatorpolicy finetuningrobotic controlstate-action visitationmanipulation tasks
Authors
Raymond Tsao, Andrew Wagenmaker, Sergey Levine
Abstract
In many modern applications of reinforcement learning (RL), the natural reward for a task of interest is inherently sparse: a reward of 0 is given everywhere except when the task is completed, when a reward of +1 is given. Training a policy to maximize such a sparse reward requires solving a challenging credit assignment problem, leading to slow or ineffective RL improvement. We propose a simple approach to transform a sparse outcome reward into a dense process reward. Our approach relies on training a discriminator to distinguish between previous successful and unsuccessful episodes, and using this discriminator to incentivize the RL-learned policy to match the state-action visitations of successful episodes, while avoiding those of unsuccessful episodes. By incentivizing the policy to match the visitations over all states, not just those that correspond to task success, this reward provides dense feedback on whether progress is being made towards task completion, and, we show, provably achieves this without changing the optimal policy. Focusing on finetuning of robotic control policies, we demonstrate that our approach leads to significantly faster RL finetuning performance on both simulated and real-world manipulation tasks, as compared to simply maximizing the sparse outcome reward.