Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)

2026-06-03 • Machine Learning

Machine LearningArtificial IntelligenceComputation and Language

AI summaryⓘ

The authors found that when language models mess up on reasoning tasks, simply trying again more times isn't always helpful because some errors are random while others are more deeply rooted. They discovered that analyzing the patterns in failed attempts (not the text itself) can reveal whether a failure can be fixed by retrying or if it needs a different approach. Using three key features, they can group failures into types and decide on better strategies to fix them without retraining the model. Their method improved problem-solving accuracy and works across different tasks and models.

language modelsreasoning problemsinference-timefailed rolloutstrajectory featurestest-time interventionspost-training methodsmodel routingSteerable-Hard subsettraining-free analysis

Authors

Nizar Islah, Istabrak Abbes, Irina Rish, Sarath Chandar, Eilif B. Muller

Abstract

When post-trained language models fail on reasoning problems, the common test-time-scaling response is to spend more compute on additional attempts, and the failed traces play no further role. We argue this discards a crucial signal; some failures come from unlucky sampling, where more rollouts help, while others are structural and resist resampling regardless of budget. We propose that failed traces encode recoverability structure: the inference-time signature of which test-time interventions can rescue a given failure. Three problem-level trajectory features, derived from the structure of available interventions, recover this structure from the distributional signature of failed rollouts, not their text. They cluster failures into stable regimes, characterize the failure topography of different post-training methods ($84.3{\pm}4.3\%$ accuracy, $+20\%$ over a majority-class baseline), and support a training-free routing rule that lifts rescue by $+12.2\%$ on the deployment-relevant Steerable-Hard subset (failures where retry is insufficient and a bounded intervention is reachable). The features and the routing rule transfer across two cross-family probes. The same three features thus convert failed traces from discarded data into a diagnostic object, supporting test-time routing and post-training analysis without training-time or weight-space access.

View PDFOpen arXiv