Learning to Think from Multiple Thinkers
2026-04-27 • Machine Learning
Machine LearningArtificial IntelligenceComputational Complexity
AI summaryⓘ
The authors study how learning from step-by-step explanations (called Chain-of-Thought or CoT) works when these explanations come from multiple sources who all provide correct but different solutions. They show that when only passively given data from a few thinkers is available, learning can be very hard due to complex computational reasons. However, they also design an active learning method that efficiently learns by asking for a small amount of CoT data from several thinkers and using enough final answers, making the process easier and more reliable. This approach requires fewer explanations as it becomes more accurate, making learning from multiple thinkers practical under certain conditions.
Chain-of-Thoughtsupervisionpolylogarithmicactive learningcryptographic assumptionscomputational learning theoryend-result supervisionpassive data collectionlearning complexity
Authors
Nirmit Joshi, Roey Magen, Nathan Srebro, Nikolaos Tsilivis, Gal Vardi
Abstract
We study learning with Chain-of-Thought (CoT) supervision from multiple thinkers, all of whom provide correct but possibly systematically different solutions, e.g., step-by-step solutions to math problems written by different thinkers, or step-by-step execution traces of different programs solving the same problem. We consider classes that are computationally easy to learn using CoT supervision from a single thinker, but hard to learn with only end-result supervision, i.e., without CoT (Joshi et al. 2025). We establish that, under cryptographic assumptions, learning can be hard from CoT supervision provided by two or a few different thinkers, in passive data-collection settings. On the other hand, we provide a generic computationally efficient active learning algorithm that learns with a small amount of CoT data per thinker that is completely independent of the target accuracy $\varepsilon$, a moderate number of thinkers that scales as $\log \frac{1}{\varepsilon}\log \log \frac{1}{\varepsilon}$, and sufficient passive end-result data that scales as $\frac{1}{\varepsilon}\cdot poly\log\frac{1}{\varepsilon}$.