*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

2026-02-17Computation and Language

Computation and Language
AI summary

The authors looked at ways to judge the quality of text made by computers without using slow and costly methods. They improved a tool called ParaPLUIE, which guesses how confident it is in simple yes/no answers by measuring how surprising the text is to a language model. Their new versions, called *-PLUIE, are tailored to specific tasks and match human opinions better while still being fast and efficient. This means their method can quickly and accurately evaluate generated text without extra steps.

Large Language ModelsText GenerationPerplexityLLM-as-a-judgePromptingHuman Judgment AlignmentComputational EfficiencyAutomatic Text EvaluationConfidence Estimation
Authors
Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu, Arnaud Delhay, Damien Lolive
Abstract
Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.