Two-Action Apple Tasting with Switching Costs

2026-06-02Machine Learning

Machine Learning
AI summary

The authors study a problem where a learner repeatedly decides between two actions: one that reveals information but gives no reward, and one that yields a hidden reward but no information. Switching between actions costs the learner, and they measure regret compared to always choosing the best fixed action. Previous algorithms suggested regret grows roughly like T^(2/3), but the authors prove the regret actually grows more slowly, around the square root of T. This finding clarifies the difficulty of this problem and corrects a previous assumption about its complexity.

apple-tasting problemswitching costsregret minimizationoblivious adversaryfeedback graphsminimax regretonline learningrevealing actionblind actionT^(2/3) regret bound
Authors
Tommaso Cesari, Roberto Colomboni
Abstract
We study the two-action apple-tasting problem with switching costs against an oblivious adversary. In an equivalent normalized formulation, at each round the learner chooses between a revealing action and a blind action: the revealing action gives reward $0$ and reveals the hidden value $x_t\in[-1,1]$ of the blind action; the blind action gives reward $x_t$ but reveals nothing. The learner pays one unit whenever they switches actions, and regret is measured against the best fixed action in hindsight. General feedback-graph algorithms with switching costs give $\widetilde O(T^{2/3})$ regret guarantees for this problem. The two-action apple-tasting graph was the natural candidate for the missing $Ω(T^{2/3})$ obstruction in the switching-cost classification: such a lower bound would have transferred to a large family of still-unclassified feedback graphs. We prove that this obstruction is not there: the oblivious minimax expected regret for this problem satisfies \[ \frac{1}{2\sqrt3}\cdot\sqrt T \le R_T^\star \le 2\sqrt{3}\cdot \sqrt{T}. \]