Preference-Calibrated Human-in-the-Loop Reinforcement Learning for Robotic Manipulation

2026-06-02 • Robotics

Robotics

AI summaryⓘ

The authors address a problem in training robots where human help is used to guide learning, but sometimes the robot's mistakes make it hard to know which actions actually led to failure. They create a method called PACT that better figures out which parts of a robot's actions were suboptimal and adjusts learning accordingly by comparing human corrections to the robot's usual choices. This approach helps the robot learn faster and perform better in manipulating objects, as shown by improvements across five real-robot tasks. Their technique uses human intervention signals more precisely to improve both how quickly and how well robots learn.

Human-in-the-loop reinforcement learningcredit assignmentactor-critic methodsrobot manipulationQ-valuesBellman equationpolicy alignmentsample efficiencysuboptimal actionspreference calibration

Authors

Zeyi Liu, Guangyao Liu, Yinuo Qu, Yuquan Xue, Bofang Jia, Chunhua Yang, Weihua Gui, Keke Huang, Ziwei Wang

Abstract

Human-in-the-loop reinforcement learning (HIL-RL) improves sample efficiency in real-robot manipulation through online human intervention. However, successful trajectories may include suboptimal actions that deviate from the desired task-execution path and force human intervention. Existing HIL-RL methods typically apply the consistent credit assignment principle to all transitions, uniformly propagating discounted terminal rewards through suboptimal segments, ignoring the actual contribution of each transition to task success. This overestimates Q-values for critic learning and indirectly misguides actor updates toward suboptimal behavior patterns. To this end, we propose PACT, a Preference-calibrated Actor-Critic Training framework that leverages the implicit preference signals induced by intervention to perform credit reassignment on identified suboptimal segments while directly guiding policy training for unbiased critic-actor learning. Specifically, we first design a progress model that learns from human demonstration and identifies suboptimal segments for credit correction. Then, from the human action and resampled policy action at the intervention state, we build preference pairs to define a counterfactual advantage that penalizes Bellman targets of the identified suboptimal segment, enabling directional credit calibration. Moreover, we directly align the policy with human corrective actions in the bounded mean space, providing an additional signal beyond critic-guided updates. Across five real-robot manipulation tasks, PACT improves the average success rate by 24.5% and achieves 1.3 times faster convergence, thereby improving both RL sample efficiency and performance. Code is available at https://anonymous.4open.science/r/HILRL-A1X-BC05.

View PDFOpen arXiv