Online Learning and Equilibrium Computation with Ranking Feedback

2026-03-19 • Machine Learning

Machine LearningComputation and LanguageComputer Science and Game Theory

AI summaryⓘ

The authors study a way for a learning system to improve its decisions when it only sees rankings of actions instead of exact scores, which can happen when working with humans or when privacy is important. They find that it’s usually impossible to do well (have low regret) just from rankings based on immediate results or average results unless the environment is somewhat predictable or stable. They create new algorithms that can learn effectively if the changes in the environment's feedback are not too wild. Their methods also help players in a game reach a stable outcome over time and work well in directing large language models online.

online learningregret minimizationranking feedbackfull-information feedbackbandit feedbackPlackett-Luce modeltotal variationcoarse correlated equilibriumnormal-form gameslarge language models

Authors

Mingyang Liu, Yongshan Chen, Zhiyuan Fan, Gabriele Farina, Asuman Ozdaglar, Kaiqing Zhang

Abstract

Online learning in arbitrary, and possibly adversarial, environments has been extensively studied in sequential decision-making, and it is closely connected to equilibrium computation in game theory. Most existing online learning algorithms rely on \emph{numeric} utility feedback from the environment, which may be unavailable in human-in-the-loop applications and/or may be restricted by privacy concerns. In this paper, we study an online learning model in which the learner only observes a \emph{ranking} over a set of proposed actions at each timestep. We consider two ranking mechanisms: rankings induced by the \emph{instantaneous} utility at the current timestep, and rankings induced by the \emph{time-average} utility up to the current timestep, under both \emph{full-information} and \emph{bandit} feedback settings. Using the standard external-regret metric, we show that sublinear regret is impossible with instantaneous-utility ranking feedback in general. Moreover, when the ranking model is relatively deterministic, \emph{i.e.}, under the Plackett-Luce model with a temperature that is sufficiently small, sublinear regret is also impossible with time-average utility ranking feedback. We then develop new algorithms that achieve sublinear regret under the additional assumption that the utility sequence has sublinear total variation. Notably, for full-information time-average utility ranking feedback, this additional assumption can be removed. As a consequence, when all players in a normal-form game follow our algorithms, repeated play yields an approximate coarse correlated equilibrium. We also demonstrate the effectiveness of our algorithms in an online large-language-model routing task.

View PDFOpen arXiv