GPU-Parallel Multi-Task Reinforcement Learning with Demonstration Guided Policy Optimization
2026-06-02 • Robotics
Robotics
AI summaryⓘ
The authors created a new way to train multiple robot tasks at once using powerful GPUs, instead of training one task at a time. They built a benchmark called MT-Libero that combines various robot manipulation tasks to train efficiently with different inputs and physics settings. To improve learning when successful examples are rare, they developed a method called DGPO that uses demonstrations to guide the training process while still learning on its own. This approach helps the system learn better and more stably than previous methods. Overall, their work helps train robots to do many different tasks more effectively.
reinforcement learningGPU parallelismmulti-task learningrobot manipulationdemonstration learningon-policy learningPPO (Proximal Policy Optimization)behavior cloningphysics randomizationIsaac Lab
Authors
Rui Zhang, Qiwei Wu, Zhengyu Zhang, Tao Li, Yunrong Guo, Junjie Lai, Renjing Xu, Weihua Zhang
Abstract
Large scale GPU-parallel reinforcement learning has changed what can be trained in robot simulation, yet most systems still optimize one specialist policy per task. We propose a construction methodology for turning structured manipulation task families into GPU-parallel multi-task RL benchmarks, and instantiate it as MT-Libero using LIBERO assets and task predicates in Isaac Lab. The resulting benchmark supports simultaneous reinforcement learning over heterogeneous task suites with parallel rendering, physics randomization, and state-input or visual-input policies. To make such training practical under sparse success signals and limited prior data, we further propose DGPO, an on-policy demonstration guided method that combines importance weighted PPO with adaptive behavior cloning on matched demonstration actions. DGPO enables a tunable preference toward demonstrated task distributions, outperforming both prior-free RL and existing demonstration-based methods while preserving the stability and online improvement benefits of on-policy PPO.