Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?
2026-03-31 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors study how well we can understand a language model's step-by-step thinking (called Chain-of-Thought or CoT) by watching its internal process. They propose a way to predict when this understanding will be good or bad based on how the model is trained, especially if the goals for final answers and the thinking steps agree or conflict. Their tests show that if the training rewards the model in conflicting ways, it becomes harder to monitor the model's thoughts. This helps explain when CoT monitoring works well or fails.
Chain-of-ThoughtLarge Language Modelsmonitorabilityreinforcement learningreward functiontraining alignmentmodel interpretabilitypost-trainingconflicting objectives
Authors
Max Kaufmann, David Lindner, Roland S. Zimmermann, and Rohin Shah
Abstract
Chain-of-Thought (CoT) monitoring, in which automated systems monitor the CoT of an LLM, is a promising approach for effectively overseeing AI systems. However, the extent to which a model's CoT helps us oversee the model - the monitorability of the CoT - can be affected by training, for instance by the model learning to hide important features of its reasoning. We propose and empirically validate a conceptual framework for predicting when and why this occurs. We model LLM post-training as an RL environment where the reward decomposes into two terms: one term depending on final outputs and another term depending on the CoT. Our framework allows us to classify these two terms as "aligned", "orthogonal", or "in-conflict" before training. We predict that training with in-conflict terms will reduce monitorability, orthogonal terms will not affect it, and aligned terms will improve it. To validate our framework, we use it to classify a set of RL environments, train LLMs within those environments, and evaluate how training affects CoT monitorability. We find that (1) training with "in-conflict" reward terms reduces CoT monitorability and (2) optimizing in-conflict reward terms is difficult.