CoLT: Teaching Multi-Modal Models to Think with Chain of Latent Thoughts

2026-06-30 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors propose CoLT, a new method for multi-modal models to solve visual reasoning tasks by thinking in short chains of hidden 'thought' steps instead of long, slow text explanations. They created a special decoder during training to guide the model's hidden reasoning and ensure each step makes sense, but remove it during use to keep things fast. Their tests show CoLT is much quicker than traditional text-based reasoning and better than other hidden-state methods, without needing extra labeled images. This approach helps models reason more efficiently by focusing on compact internal representations.

Chain-of-thought reasoningMulti-modal large language modelsLatent representationsVisual reasoningInference efficiencyDecoder supervisionStep-level supervisionText decodingModel training stabilityCoLT framework

Authors

Lianyu Hu, Shengqian Qin, Zeqin Liao, Qing Guo, Liang Wan, Wei Feng, Yang Liu

Abstract

Chain-of-thought (CoT) reasoning has enabled multi-modal large language models (MLLMs) to tackle complex visual reasoning tasks by generating explicit intermediate reasoning steps in natural language. However, this text-based reasoning paradigm is inherently slow at inference time with even thousands of tokens and fundamentally constrained by the expressiveness of natural language. In this paper, we propose CoLT, (Chain of Latent Thoughts), a novel framework that teaches multi-modal models to reason through a chain of latent thought representations instead of verbose text tokens, which can perform thinking with as few as 3 steps. Naively forcing the model to think with latent states easily produces meaningless semantics and makes training unstable. To effectively regulate the latent reasoning process, we introduce a lightweight external decoder that provides step-level supervision for each latent reasoning step in two complementary directions: a forward mode that decodes latent thoughts into the textual reasoning of the next step, and a backward mode that aligns decoder hidden states with the model's latent thoughts given preceding textual context. We further incorporate internal supervision that encourages coherent step-by-step latent transitions. The decoder and internal supervision are removed during inference to maintain high efficiency of latent reasoning. Extensive experiments on eight benchmarks demonstrate that CoLT not only outperforms existing latent reasoning methods such as CODI and SIM-CoT, but also surpasses latent visual reasoning approaches that rely on auxiliary images with costly annotation requirements. Compared to text CoT methods, CoLT can notably reduce the inference time by 10.1$\times$ and text decoding time by 22.6$\times$. Code is released at https://github.com/hulianyuyy/CoLT.

View PDFOpen arXiv