SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer

2026-02-19 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors studied why offline-trained RL agents often get worse when fine-tuned online. They found that the problem comes from 'valleys' in the performance landscape between offline and online solutions, which cause drops during gradient updates. To fix this, they created SMAC, a method that shapes the offline training so the transition to online learning is smooth and does not decrease performance. Experiments showed that SMAC smoothly improves online and reduces regret compared to other methods on standard benchmarks.

offline reinforcement learningonline fine-tuningactor-critic methodsvalue functiongradient descentScore Matched Actor-Critic (SMAC)Soft Actor-Critic (SAC)TD3D4RL benchmarkperformance landscape

Authors

Nathan S. de Lara, Florian Shkurti

Abstract

Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6 D4RL tasks. In 4/6 environments, it reduces regret by 34-58% over the best baseline.

View PDFOpen arXiv