Reasoning Structure of Large Language Models
2026-06-02 • Artificial Intelligence
Artificial IntelligenceMachine Learning
AI summaryⓘ
The authors created a new way to test large reasoning models (LRMs) using logic puzzles and turned their answers into detailed maps of their thought process. This helps reveal how the models think, not just whether they got the right answer or how many words they used. They also made a new score that shows how smoothly the models reason through the problems. Their work helps spot different kinds of mistakes and compare model thinking as puzzles get harder.
large reasoning modelslogic puzzlesreasoning graphsfinal-answer accuracytoken countbenchmarkreasoning efficiencyfailure diagnosispuzzle difficulty
Authors
Frédéric Berdoz, Luca A. Lanzendörfer, Fabian Farestam, Roger Wattenhofer
Abstract
Large reasoning models (LRMs) are often evaluated using metrics such as final-answer accuracy or token count. However, identical scores on these metrics can hide fundamentally different reasoning structures. To address this limitation, we introduce a scalable LRM benchmark of logic puzzles and a pipeline that converts unstructured traces into verifiable reasoning graphs of claims and dependencies. This turns reasoning into a structured, measurable object whose topology can be quantitatively analyzed. Building on this, we define a reasoning efficiency metric that quantifies how concentrated the model's logical flow is. Our analysis on open-source reasoning models shows that structural measurements separate behaviors that token count and accuracy conflate, providing a practical tool for diagnosing failure modes and comparing how reasoning scales with puzzle difficulty.