LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

2026-05-08 • Computation and Language

Computation and Language

AI summaryⓘ

The authors present AutoTTS, a new method that automatically finds better ways to spend extra computation time when a language model is answering questions. Instead of manually creating rules for this process, they build an environment where the system can learn these rules by itself efficiently. They tested AutoTTS on math problems and found that it makes smarter use of computation, improving accuracy without much extra cost. Their discovered methods also work well on new tasks and different sized models, and the whole process is relatively cheap and fast.

Test-time scalingLarge language modelsInference optimizationController synthesisReasoning trajectoriesBeta parameterizationMathematical reasoning benchmarksComputation allocationAutomation in machine learningOpen-source software

Authors

Tong Zheng, Haolin Liu, Chengsong Huang, Huiwen Bao, Sheng Zhang, Rui Liu, Runpeng Dai, Ruibo Chen, Chenxi Liu, Tianyi Xiong, Xidong Wu, Hongming Zhang, Heng Huang

Abstract

Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width--depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.

View PDFOpen arXiv