Pluralistic Leaderboards

2026-06-01Computer Science and Game Theory

Computer Science and Game Theory
AI summary

The authors point out that common ways to rank large language models (LLMs) by combining all user feedback into one score don't work well when users have very different preferences. To fix this, they propose a new method inspired by social choice theory that creates rankings stable even with diverse user opinions. Their approach focuses on ensuring the top models are generally preferred by most users and needs fewer comparisons per user. They tested their method on real data and found it gives more reliable rankings than the usual method.

large language modelsleaderboardBradley-Terry modelpairwise comparisonlocal stabilitysocial choice theorypluralistic leaderboarduser preferencesLMArena
Authors
Nika Haghtalab, Ariel D. Procaccia, Han Shao, Serena Lutong Wang, Kunhe Yang
Abstract
Recent leaderboard-based evaluations of large language models aggregate user feedback by fitting a Bradley--Terry model to pairwise comparisons, producing a single global ranking based on a latent quality score. While appealing for its simplicity, this approach is incompatible with heterogeneous preferences: when LLMs are used across diverse tasks and use cases, users who favor fundamentally different model behaviors can be systematically misrepresented when collapsed into a single quality score. To address this issue, we study \emph{pluralistic leaderboards} that aim to remain \emph{stable} with respect to heterogeneous user populations. Drawing on ideas from social choice theory, we adapt the notion of \emph{local stability}, which requires that no model outside the top-$k$ positions is collectively preferred to the top-$k$ set by more than $O(1/k)$ fraction of users. Building on techniques from the social choice literature, we design an alternative leaderboard mechanism that satisfies local stability while eliciting only $\widetilde{O}(k)$ pairwise comparisons per user, where $k$ is the size of the prefix for which stability is guaranteed. Using data from LMArena, we show that standard Bradley--Terry aggregation can violate local stability in practice, whereas our method provides substantially stronger stability guarantees.