When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

2026-05-06Robotics

RoboticsArtificial Intelligence
AI summary

The authors present Q2RL, a new method that helps robots learn better after initial demonstrations by combining behavior cloning (BC) with reinforcement learning (RL). First, they estimate a Q-value function from the robot's initial BC policy with limited interactions, then use a gating mechanism to decide when to follow BC or RL actions based on these Q-values. This approach avoids overwriting good learned actions and improves learning efficiency. Their tests on robotic manipulation tasks show that Q2RL learns faster and achieves higher success than existing methods, working well even with a short online training time.

Behavior CloningReinforcement LearningQ-FunctionOffline-to-Online LearningPolicyDistribution MismatchManipulation TasksRobotic LearningQ-GatingD4RL Benchmark
Authors
Lakshita Dodeja, Ondrej Biza, Shivam Vats, Stephen Hart, Stefanie Tellex, Robin Walters, Karl Schmeckpeper, Thomas Weng
Abstract
Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously learned good actions due to a distribution mismatch between offline data and online learning. In this work, we propose Q2RL, Q-Estimation and Q-Gating from BC for Reinforcement Learning, an algorithm for efficient offline-to-online learning. Our method consists of two parts: (1) Q-Estimation extracts a Q-function from a BC policy using a few interaction steps with the environment, followed by online RL with (2) Q-Gating, which switches between BC and RL policy actions based on their respective Q-values to collect samples for RL policy training. Across manipulation tasks from D4RL and robomimic benchmarks, Q2RL outperforms SOTA offline-to-online learning baselines on success rate and time to convergence. Q2RL is efficient enough to be applied in an on-robot RL setting, learning robust policies for contact-rich and high precision manipulation tasks such as pipe assembly and kitting, in 1-2 hours of online interaction, achieving success rates of up to 100% and up to 3.75x improvement against the original BC policy. Code and video are available at https://pages.rai-inst.com/q2rl_website/