A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners

2026-06-02Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors studied how large language models (LLMs), after being fine-tuned with planning examples, represent the planning problems they solve. They found that fine-tuned LLMs learn to internally recognize valid actions and some important facts about the problem states, even if their output probabilities do not always clearly show this. They also showed that training on more diverse data helps the models better understand the underlying rules of the planning task. Overall, the authors developed methods to interpret how these models represent planning knowledge.

Large Language ModelsSupervised Fine-TuningClassical PlanningWorld ModelAction ValidityState PredicatesInterpretabilityInternal RepresentationsGenerative ModelsRandom Walk Data
Authors
Patrick Emami, Nan Qiang, Peter Graf
Abstract
Supervised fine-tuning (SFT) improves end-to-end classical planning in large language models (LLMs), but do these models also learn to represent and reason about the planning problems they are solving? Due to the relative complexity of classical planning problems and the challenge that end-to-end plan generation poses for LLMs, it has been difficult to explore this question. In our work, we devise and perform a series of interpretability experiments that holistically interrogate world model recovery by examining both internal representations and generative capabilities of fine-tuned LLMs. We find that: a) Supervised fine-tuning on valid action sequences enables LLMs to linearly encode action validity and some state predicates. b) Models that struggle to use output probabilities for classifying action validity may still learn internal representations that separate valid from invalid actions. c) Broader state space coverage during fine-tuning, such as from random walk data, yields more accurate recovery of the underlying world model. In summary, this work contributes a recipe for applying interpretability techniques to planning LLMs and generates insights that shed light on open questions about how knowledge is represented in LLMs.