Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes

2026-06-15Robotics

RoboticsMachine Learning
AI summary

The authors address a problem in teaching robots to perform tasks where feedback is only given as success or failure for whole attempts, but updates to the robot’s learning happen after each small step. They note that using just one reward signal mixes two goals: just completing the task and doing it efficiently. To improve this, they create a method called Hierarchical Advantage-Weighted Behavior Cloning (HABC), which separates these goals and assigns learning signals more carefully, especially when interventions happen during training. This approach leads to much better success rates on challenging robot tasks compared to previous methods.

pretrained policiesonline reinforcement learningbinary outcomeactor updatecredit assignmentadvantage-weighted behavior cloninghierarchical learningintervention-awarebimanual tasksrobot fine-tuning
Authors
Tongyan Fang, Siyuan Huang, Naiyu Fang, Ganlong Zhao, Zhongjin Luo, Jianbo Liu, Xiaogang Wang, Ying Dong, Hongsheng Li
Abstract
When pretrained VLA policies are fine-tuned through online RL, each rollout episode produces only a single binary outcome (success or failure), yet the actor update requires per-transition supervision. Existing approaches commonly reduce this sparse outcome to a single scalar reward or advantage signal, which conflates distinct forms of transition-level feedback and provides limited guidance once basic task success becomes achievable. First, a single scalar signal conflates the two objectives of viability and efficiency; once basic success is achieved, the binary label provides no gradient to distinguish efficient completions from slow ones. Second, real-world rollouts mix autonomous and intervention segments; naively assigning episode outcomes across these boundaries introduces incorrect credit assignment. To address these issues, we propose Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for these two objectives on different data subsets and combines their outputs with a state-adaptive balance. A state-adaptive gate $g_t$ merges their one-step advantages, prioritizing viability when success is uncertain and shifting to efficiency only when viability is high, and converts the result into per-transition weights on the actor loss. Intervention-aware credit assignment further restricts outcome labels to segments executed by the current policy, preventing supervision from leaking across intervention boundaries. In real-robot experiments on three contact-rich bimanual tasks, HABC raises success from supervised fine-tuning (SFT) baselines of 36%, 44%, and 12% to 92%, 88%, and 38%.