Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement
2026-04-07 • Machine Learning
Machine LearningArtificial IntelligenceComputation and Language
AI summaryⓘ
The authors studied how large language models learn to represent the world inside their ‘minds.’ They found that predicting multiple tokens at once (MTP) helps models form more consistent internal states compared to predicting one token at a time. However, MTP can cause the model to take shortcuts that break real-world rules. To fix this, the authors developed a method called LSE-MTP that ties predictions more closely to the true hidden states, improving how well the model’s internal understanding matches reality and making it more stable.
Large Language ModelsNext-Token PredictionMulti-Token PredictionInternal World ModelsGradient Inductive BiasRepresentation LearningStructural HallucinationsLatent SpaceState TrajectoriesRobustness
Authors
Qimin Zhong, Hao Liao, Haiming Qin, Mingyang Zhou, Rui Mao, Wei Chen, Naipeng Chao
Abstract
Whether Large Language Models (LLMs) develop coherent internal world models remains a core debate. While conventional Next-Token Prediction (NTP) focuses on one-step-ahead supervision, Multi-Token Prediction (MTP) has shown promise in learning more structured representations. In this work, we provide a theoretical perspective analyzing the gradient inductive bias of MTP, supported by empirical evidence, showing that MTP promotes the convergence toward internal belief states by inducing representational contractivity via gradient coupling. However, we reveal that standard MTP often suffers from structural hallucinations, where discrete token supervision encourages illegal shortcuts in latent space that violate environmental constraints. To address this, we propose a novel method Latent Semantic Enhancement MTP (LSE-MTP), which anchors predictions to ground-truth hidden state trajectories. Experiments on synthetic graphs and real-world Manhattan Taxi Ride show that LSE-MTP effectively bridges the gap between discrete tokens and continuous state representations, enhancing representation alignment, reducing structural hallucinations, and improving robustness to perturbations.