Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

2026-05-13Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors address challenges in evaluating large AI models safely and reliably, noting that human reviewers often have different opinions that make results inconsistent. They point out that current methods usually get only a few ratings per item and don't track who rated what, which limits understanding of variation between reviewers. To fix this, the authors propose a way to better simulate how reviewers behave using lots of data with known reviewers. They study how many items and ratings per item are needed to get trustworthy results. Their approach helps improve the repeatability of AI evaluations.

generative AIlarge language modelsevaluation metricshuman annotationreproducibilitystatistical significancebootstrappinginter-rater variabilityexperimental repeatabilitydata annotation
Authors
Deepak Pandita, Flip Korn, Chris Welty, Christopher M. Homan
Abstract
As generative AI models such as large language models (LLMs) become more pervasive, ensuring the safety, robustness, and overall trustworthiness of these systems is paramount. However, AI is currently facing a reproducibility crisis driven by unreliable evaluations and unrepeatable experimental results. While human raters are often used to assess models for utility and safety, they introduce divergent biases and subjective opinions into their annotations. Overcoming this variance is exceptionally challenging because very little data exists to study how experimental repeatability actually improves as the annotator pool grows. Standard evaluation practices typically rely on a small number of annotations per item (often 3 to 5) and lack the persistent rater identifiers necessary to model individual variance across items. In this work, we introduce a multi-level bootstrapping approach to realistically model annotator behavior. Leveraging datasets with a large number of ratings and persistent rater identifiers, we analyze the tradeoffs between the number of items ($N$) and the number of responses per item ($K$) required to achieve statistical significance.