Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis

2026-04-15 • Computation and Language

Computation and Language

AI summaryⓘ

The authors identify two main types of mistakes in large language models' (LLMs) reasoning: internal errors within steps and problems with how thoroughly steps are thought out. They find that simply giving correct example answers does not help improve reasoning. To fix this, they introduce CRAFT, which combines agreement from multiple reasoning attempts into a graph and generates a better overall explanation. Their method improves accuracy by over 10% and produces higher-quality reasoning on both logic and math tasks compared to other approaches.

Large Language ModelsReasoning TracesStep Internal FlawsStep-wise FlawsReasoning Knowledge GraphTopological GenerationLogical ReasoningMathematical ReasoningConsensus MethodsBenchmark Evaluation

Authors

Zipeng Ling, Shuliang Liu, Shenghong Fu, Yuehao Tang, Seonil Son, Yao Wan, Xuming Hu

Abstract

LLM reasoning traces suffer from complex flaws -- *Step Internal Flaws* (logical errors, hallucinations, etc.) and *Step-wise Flaws* (overthinking, underthinking), which vary by sample. A natural approach would be to provide ground-truth labels to guide LLMs' reasoning. Contrary to intuition, we show that this yields no improvement in reasoning ability. We then propose CRAFT, a unified framework that mitigates both types of Step flaws, which builds a Reasoning Knowledge Graph (RKG) based on the consensus parts of multiple candidate traces, and synthesizes a high-quality trace through topological generation. Our approach improves label-prediction accuracy by 10+% on average, and consistently outperforms all baselines across both logical and mathematical reasoning benchmarks. Further, detailed benchmark evaluation proves that our method also improves the quality of LLMs' reasoning traces in multiple dimensions.

View PDFOpen arXiv