$π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
2026-04-15 • Machine Learning
Machine LearningComputation and Language
AI summaryⓘ
The authors study deep search agents, which find information but are hard to train because rewards are rare and labeled data is limited. They find that during self-play, a natural step called the question construction path (QCP) shows how a solution is composed in reverse. Using this QCP as extra guidance, their new method called π-Play helps models learn better by turning sparse rewards into more frequent feedback without needing human input. Their experiments show π-Play beats fully supervised methods and speeds up learning compared to normal self-play.
Deep search agentsSelf-playSparse rewardsCredit assignmentSelf-distillationQuestion construction path (QCP)Privileged informationMulti-agent learningSelf-evolutionDense supervision
Authors
Yaocheng Zhang, Yuanheng Zhu, Wenyue Chong, Songjun Tu, Qichao Zhang, Jiajun Chai, Xiaohan Wang, Wei Lin, Guojun Yin, Dongbin Zhao
Abstract
Deep search agents have emerged as a promising paradigm for addressing complex information-seeking tasks, but their training remains challenging due to sparse rewards, weak credit assignment, and limited labeled data. Self-play offers a scalable route to reduce data dependence, but conventional self-play optimizes students only through sparse outcome rewards, leading to low learning efficiency. In this work, we observe that self-play naturally produces a question construction path (QCP) during task generation, an intermediate artifact that captures the reverse solution process. This reveals a new source of privileged information for self-distillation: self-play can itself provide high-quality privileged context for the teacher model in a low-cost and scalable manner, without relying on human feedback or curated privileged information. Leveraging this insight, we propose Privileged Information Self-Play ($π$-Play), a multi-agent self-evolution framework. In $π$-Play, an examiner generates tasks together with their QCPs, and a teacher model leverages QCP as privileged context to densely supervise a student via self-distillation. This design transforms conventional sparse-reward self-play into a dense-feedback self-evolution loop. Extensive experiments show that data-free $π$-Play surpasses fully supervised search agents and improves evolutionary efficiency by 2-3$\times$ over conventional self-play.