Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher

2026-05-31Machine Learning

Machine LearningComputation and Language
AI summary

The authors study how to teach a strong model using lessons from a weaker model when good answers are hard to find. They focus on picking which weaker model's answers to trust by giving each one a score, then only using the trustworthy ones for learning. This method works well in different areas and sometimes even beats using perfect answers. They also show that by repeating this trust-based teaching, the models keep getting better over time.

weak supervisionstrong studentweak teacherdata selectiontrust functionslabel noiseiterative trainingmodel generalizationsupervised learningmachine learning
Authors
Arda Uzunoglu, Alvin Zhang, Daniel Khashabi
Abstract
Weak-to-strong generalization studies how to improve a strong student using supervision from a weaker teacher when reliable labels are scarce. We view this primarily as a data selection problem, where the key challenge is to identify which weak labels are reliable enough to serve as a training signal. To address this, we introduce trust functions that assign each weak label a scalar trust score and use these scores to filter weak supervision. Across several domains, including world knowledge, quantitative reasoning, and strategy games, trust filtering yields students that match and sometimes surpass ground-truth supervision, achieving near-lossless weak-to-strong generalization. Moreover, trust functions enable an iterative weak-to-strong chain that compounds gains by training a student and reusing it as the next teacher, amplifying the gains. There are several mechanisms to which advantage of trust functions can be attributed.