Learning to Think from Multiple Thinkers

2026-04-27 • Machine Learning

Machine LearningArtificial IntelligenceComputational Complexity

AI summaryⓘ

The authors study how learning from step-by-step explanations (called Chain-of-Thought or CoT) works when these explanations come from multiple sources who all provide correct but different solutions. They show that when only passively given data from a few thinkers is available, learning can be very hard due to complex computational reasons. However, they also design an active learning method that efficiently learns by asking for a small amount of CoT data from several thinkers and using enough final answers, making the process easier and more reliable. This approach requires fewer explanations as it becomes more accurate, making learning from multiple thinkers practical under certain conditions.

Chain-of-Thoughtsupervisionpolylogarithmicactive learningcryptographic assumptionscomputational learning theoryend-result supervisionpassive data collectionlearning complexity

Authors

Nirmit Joshi, Roey Magen, Nathan Srebro, Nikolaos Tsilivis, Gal Vardi

Abstract

We study learning with Chain-of-Thought (CoT) supervision from multiple thinkers, all of whom provide correct but possibly systematically different solutions, e.g., step-by-step solutions to math problems written by different thinkers, or step-by-step execution traces of different programs solving the same problem. We consider classes that are computationally easy to learn using CoT supervision from a single thinker, but hard to learn with only end-result supervision, i.e., without CoT (Joshi et al. 2025). We establish that, under cryptographic assumptions, learning can be hard from CoT supervision provided by two or a few different thinkers, in passive data-collection settings. On the other hand, we provide a generic computationally efficient active learning algorithm that learns with a small amount of CoT data per thinker that is completely independent of the target accuracy $\varepsilon$, a moderate number of thinkers that scales as $\log \frac{1}{\varepsilon}\log \log \frac{1}{\varepsilon}$, and sufficient passive end-result data that scales as $\frac{1}{\varepsilon}\cdot poly\log\frac{1}{\varepsilon}$.

View PDFOpen arXiv