FUSE: Ensembling Verifiers with Zero Labeled Data

2026-04-20Computation and Language

Computation and LanguageMachine Learning
AI summary

The authors present a new method called Fully Unsupervised Score Ensembling (FUSE) to improve how well large language models can verify answers without needing any correct answer labels. FUSE works by smartly combining multiple imperfect verifiers while controlling how they depend on each other to boost performance. Their experiments show that FUSE performs as well as or better than methods that require some labeled data, tested on a variety of language tasks and benchmarks. This method helps verify model outputs more reliably without expensive human annotations.

large language modelsverificationensemble methodsunsupervised learningreward modelsspectral algorithmsground truthbenchmark datasetssemi-supervised learningmodel evaluation
Authors
Joonhyuk Lee, Virginia Ma, Sarah Zhao, Yash Nair, Asher Spector, Regev Cohen, Emmanuel J. Candès
Abstract
Verification of model outputs is rapidly emerging as a key primitive for both training and real-world deployment of large language models (LLMs). In practice, this often involves using imperfect LLM judges and reward models since ground truth acquisition can be time-consuming and expensive. We introduce Fully Unsupervised Score Ensembling (FUSE), a method for improving verification quality by ensembling verifiers without access to ground truth correctness labels. The key idea behind FUSE is to control conditional dependencies between verifiers in a manner that improves the unsupervised performance of a class of spectral algorithms from the ensembling literature. Despite requiring zero ground truth labels, FUSE typically matches or improves upon semi-supervised alternatives in test-time scaling experiments with diverse sets of generator models, verifiers, and benchmarks. In particular, we validate our method on both conventional academic benchmarks such as GPQA Diamond and on frontier, unsaturated benchmarks such as Humanity's Last Exam and IMO Shortlist questions.