When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

2026-05-01 • Computation and Language

Computation and Language

AI summaryⓘ

The authors studied how well large language models (LLMs) follow step-by-step instructions to do arithmetic calculations. They created a test where models had to carry out multi-step math procedures and checked if the final answers were correct. They found that as the number of steps increased, the models made more mistakes, often skipping steps or adding wrong ones. This shows that just getting the final answer right doesn’t always mean the model truly followed the instructions carefully.

large language modelsprocedural executionarithmetic algorithmsstep-wise reasoningbenchmarkingaccuracyinstruction followingintermediate variableserror analysisself-correction

Authors

Sailesh Panda, Pritam Kadasi, Abhishek Upperwal, Mayank Singh

Abstract

Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through a controlled diagnostic benchmark for procedural execution, where models are given a step-wise arithmetic algorithm and two numeric inputs, and must return the final computed value. The benchmark uses simple arithmetic operations but increases complexity through algorithm length and look-back dependencies over intermediate variables. Across 14 models and 55 datasets, average first-answer accuracy drops from 61% on 5-step procedures to 20% on 95-step procedures. Generation-level analysis shows that failures often involve missing answers, premature answers, self-correction after an initial error, under-executed traces, and hallucinated extra steps. These findings suggest that apparent reasoning ability can mask substantial weaknesses in faithful instruction execution.

View PDFOpen arXiv