Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation
2026-06-03 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors notice that current AI models deciding how much effort to spend on tasks usually only consider how hard the task is, treating all mistakes as equally bad. They argue this isn't realistic because some errors cause bigger problems than others. So, they create a method that estimates how costly a mistake would be from the task description and then spends more computing power on tasks where errors are more expensive. Their tests on software engineering tasks show that their approach cut costly mistakes by about 22% to 33% compared to just considering difficulty. Their method reliably identifies important tasks and works well even when only using simple predictors.
test-time computationcompute allocationdifficulty predictioncost-sensitive learningtask schedulingsoftware engineering benchmarksmarginal utilitycost-weighted losspredictive modelingAI resource management
Authors
Jingbo Wen, Liang He, Ziqi He
Abstract
Modern reasoning models can allocate different amounts of test-time computation, such as thinking tokens, model calls, or compute budget, to different tasks. Existing methods generally drive this allocation by predicted difficulty and spend more compute where it is expected to raise accuracy. This implicitly assumes that all failures cost the same, since an accuracy objective weights every task equally. However, such an assumption does not hold in deployment: A typo in a log message and a migration that corrupts a production database both count as one benchmark failure, but their real-world costs are fundamentally different. To fill this gap, we propose consequence-aware test-time compute allocation. Instead of routing compute only by predicted difficulty, we use a lightweight predictor to estimate from the issue text how costly a task would be if solved incorrectly. The scheduler then routes higher-consequence tasks to larger compute tiers or higher thinking budgets under the same total budget. We conduct main experiments on SWE-bench Lite and evaluate cross-dataset behavior on Multi-SWE-bench mini, covering 700 software-engineering tasks in total. Our results reveal that consequence and difficulty are approximately orthogonal under various annotations, and that current thinking models do not allocate compute sufficiently according to consequence. Moreover, our issue-only predictor never misclassifies a high-consequence task as low-consequence across the 300 SWE-bench tasks. Under matched compute budgets, our consequence-aware scheduler reduces cost-weighted loss by 22% to 33% relative to difficulty-aware routing; in particular, the priority-aware variant, which routes by per-task cost scaled by the marginal-utility signal, crosses 30%, and its deployable predictor-driven version retains over 90% of the oracle gain.