Offline-to-Online Learning in Linear Bandits
2026-06-03 • Machine Learning
Machine Learning
AI summaryⓘ
The authors study a learning problem where a model can use past offline data and new online experiences to make decisions. They design an algorithm that starts by relying more on the offline data and gradually explores more as it gathers new information. Their method performs well compared to approaches that use only offline or only online data, getting better over time with more online interactions and more offline samples. Tests show their approach works effectively in different scenarios.
online learningoffline datasetstochastic linear banditregret boundsexploration vs exploitationsublinear regretalgorithmlearning horizon
Authors
Kushagra Chandak, Toshinori Kitamura, Xiaoqi Tan
Abstract
We study online learning with an additional offline dataset in the stochastic linear bandit setting. Although this problem arises frequently in practice, the offline-to-online tradeoff remains poorly understood in structured environments. We propose a linear bandit algorithm that balances this tradeoff: it relies on offline data during early rounds, and increasingly favors exploration as the horizon grows. We establish regret bounds showing that our method is simultaneously competitive with both purely online and purely offline solutions. In particular, it achieves sublinear regret relative to the optimal action in the number of online interactions, while its regret relative to an offline reference decreases as the number of offline samples grows. Empirical results further demonstrate its effectiveness across various problem parameters.