Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning

2026-04-10Machine Learning

Machine Learning
AI summary

The authors study how to teach a computer to reach goals using only previous offline data without rewards. They improve current methods by introducing a new policy called the goal-conditioned mean flow policy, which helps the computer make better decisions by learning average directions to move at different levels of planning. Additionally, they propose a new training technique called LeJEPA loss that helps the model understand goals better by making goal representations more distinct. Their approach performs well on a variety of tasks tested in a benchmark called OGBench.

Offline Reinforcement LearningGoal-Conditioned PoliciesHierarchical ControlGaussian PoliciesMean Flow PolicyVelocity FieldRepresentation LearningLeJEPA LossOGBench BenchmarkLong-Horizon Control
Authors
Zhiqiang Dong, Teng Pang, Rongjian Xu, Guoqiang Wu
Abstract
Offline goal-conditioned reinforcement learning (GCRL) is a practical reinforcement learning paradigm that aims to learn goal-conditioned policies from reward-free offline data. Despite recent advances in hierarchical architectures such as HIQL, long-horizon control in offline GCRL remains challenging due to the limited expressiveness of Gaussian policies and the inability of high-level policies to generate effective subgoals. To address these limitations, we propose the goal-conditioned mean flow policy, which introduces an average velocity field into hierarchical policy modeling for offline GCRL. Specifically, the mean flow policy captures complex target distributions for both high-level and low-level policies through a learned average velocity field, enabling efficient action generation via one-step sampling. Furthermore, considering the insufficiency of goal representation, we introduce a LeJEPA loss that repels goal representation embeddings during training, thereby encouraging more discriminative representations and improving generalization. Experimental results show that our method achieves strong performance across both state-based and pixel-based tasks in the OGBench benchmark.