Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization
2026-06-10 • Computation and Language
Computation and Language
AI summaryⓘ
The authors propose RACES, a method that builds complex, verifiable learning environments by automatically combining simpler ones. This approach lets reinforcement learning models practice more diverse and challenging reasoning tasks without needing to manually create each new environment. Their experiments show that training with these combined environments helps models think better in new situations and makes efficient use of resources by needing fewer base environments. Overall, the authors demonstrate improved reasoning skills in large language models using this recursive environment composition.
Reinforcement LearningLarge Language ModelsVerifiable EnvironmentsRecursive CompositionEnvironment ScalingReasoning GeneralizationComposite EnvironmentsPerformance Benchmarking
Authors
Hao Xiang, Qiaoyu Tang, Le Yu, Yaojie Lu, Xianpei Han, Ben He, Le Sun, Bowen Yu, Peng Wang, Hongyu Lin, Dayiheng Liu
Abstract
Reinforcement Learning (RL) with verifiable environments has emerged as a powerful approach for enhancing the reasoning capabilities of Large Language Models (LLMs). While prior research demonstrates that scaling environment quantity improves RL performance, existing manual or individual construction methods suffer from linear scaling limits, thereby hindering scalable reasoning generalization. This paper introduces RACES (\textbf{R}ecursive \textbf{A}utomated \textbf{C}omposition for \textbf{E}nvironment \textbf{S}caling), a framework that conceptualizes verifiable environments as composable building blocks that can be recursively assembled. The key insight is that when the codomain (output type) of one environment matches the domain (input type) of another, they can be automatically fused into a new verifiable environment, enabling recursive composition. RACES is implemented with 300 individual environments and defines a set of composition operators (\textsc{SEQUENTIAL}, \textsc{PARALLEL}, \textsc{SORT}, and \textsc{SELECT}) that induce diverse reasoning patterns. Extensive experiments show that RL training on these composite environments consistently enhances reasoning generalization. Specifically, RACES improves DeepSeek-R1-Distill-Qwen-14B by an average of 3.1 points (from 48.2 to 51.3) and boosts Qwen3-14B performance from 58.8 to 61.1 on six benchmarks, which are unseen during the construction of training environments. Moreover, RACES achieves performance comparable to training on 300 individual environments using only 50 base environments, demonstrating significant efficiency in environment utilization.