Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge
2026-03-11 • Computation and Language
Computation and Language
AI summaryⓘ
The authors show that using large language models (LLMs) as judges often gives a false impression of agreement because models tend to rely on simple surface features rather than truly understanding quality. They found that even when models seem to agree a lot overall, their detailed agreement on individual cases is weaker, especially for high-quality outputs. To improve evaluation, the authors propose generating rubrics that include expert knowledge, which leads to better agreement in clear, knowledge-based fields but more varied opinions in subjective areas. Their work suggests that adding domain-specific knowledge to evaluation criteria is better than using generic rules for judging outputs.
Large Language Models (LLMs)Evaluation IllusionInter-evaluator AgreementRubric GenerationMetacognitive Enhanced Rubric Generation (MERG)Reward ModelingReverse Language Augmented In-Context Feedback (RLAIF)Spearman CorrelationIntraclass Correlation Coefficient (ICC)Evaluative Pluralism
Authors
Mingyang Song, Mao Zheng, Chenning Xu
Abstract
The paradigm of LLM-as-a-judge relies on a critical assumption, namely that high inter-evaluator agreement indicates reliable and objective evaluation. We present two complementary findings that challenge this assumption. \textbf{First}, we demonstrate that this consensus is frequently illusory. We identify and formalize \textbf{Evaluation Illusion}, a phenomenon where LLM judges generate sophisticated critiques yet anchor scores on shared surface heuristics rather than substantive quality. Through a large-scale study of 105,600 evaluation instances (32 LLMs $\times$ 3 frontier judges $\times$ 100 tasks $\times$ 11 temperatures), we show that model-level agreement (Spearman $ρ= 0.99$) masks fragile sample-level agreement (Pearson $\bar{r} = 0.72$; absolute agreement ICC $= 0.67$), that merely sharing rubric structure restores 62\% of total agreement, and that high-quality outputs paradoxically receive the \textit{least} consistent evaluations. \textbf{Second}, we demonstrate that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessment. We introduce MERG (Metacognitive Enhanced Rubric Generation), a knowledge-driven rubric generation framework whose domain-selective effects confirm this. Agreement \textit{increases} in codified domains (Education +22\%, Academic +27\%) where knowledge anchors evaluators on shared standards, while it decreases in subjective domains where genuine evaluative pluralism emerges. These findings suggest that evaluation rubrics should be dynamically enriched with expert knowledge rather than relying on generic criteria, with implications for reward modeling in RLAIF.