EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models
2026-03-12 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionComputation and Language
AI summaryⓘ
The authors address problems in combining multimodal large language models (MLLMs) with diffusion models for complex tasks, such as reasoning and planning. They found that existing methods do not reason deeply enough and use fixed guidance during decoding, limiting accuracy. To fix this, they propose a new method called Endogenous Chain-of-Thought (EndoCoT), which repeatedly improves the model's internal reasoning steps and connects these to the model's action process. This approach helps the model solve tasks step-by-step more effectively. Their tests show EndoCoT improves performance notably across various difficult puzzles and benchmarks.
Multimodal Large Language ModelsDiffusion ModelsChain-of-ThoughtText EncoderDenoising ProcessLatent Thought StatesReasoningMaze SolvingTraveling Salesman ProblemSudoku
Authors
Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang
Abstract
Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points.