Grounded World Model for Semantically Generalizable Planning

2026-04-13Robotics

RoboticsArtificial Intelligence
AI summary

The authors address a problem in Model Predictive Control (MPC), where predicting future images to guide actions usually needs a goal image beforehand, which is hard to get in new environments. They propose a new method called Grounded World Model (GWM) that scores future actions based on how well they match task instructions written in natural language, using a combined vision and language understanding. Their approach improves the system's ability to generalize to new tasks without needing exact goal images. Tests on a benchmark named WISER show their method succeeds much more often on new tasks than traditional methods that rely only on language. This means the authors created a system that better understands instructions and predicts actions in visual tasks.

Model Predictive Control (MPC)Visuomotor ControlLatent SpacePretrained Vision EncoderVision-Language AlignmentGrounded World Model (GWM)Task Instruction EmbeddingSemantic GeneralizationWISER BenchmarkReferring Expressions
Authors
Quanyi Li, Lan Feng, Haonan Zhang, Wuyang Li, Letian Wang, Alexandre Alahi, Harold Soh
Abstract
In Model Predictive Control (MPC), world models predict the future outcomes of various action proposals, which are then scored to guide the selection of the optimal action. For visuomotor MPC, the score function is a distance metric between a predicted image and a goal image, measured in the latent space of a pretrained vision encoder like DINO and JEPA. However, it is challenging to obtain the goal image in advance of the task execution, particularly in new environments. Additionally, conveying the goal through an image offers limited interactivity compared with natural language. In this work, we propose to learn a Grounded World Model (GWM) in a vision-language-aligned latent space. As a result, each proposed action is scored based on how close its future outcome is to the task instruction, reflected by the similarity of embeddings. This approach transforms the visuomotor MPC to a VLA that surpasses VLM-based VLAs in semantic generalization. On the proposed WISER benchmark, GWM-MPC achieves a 87% success rate on the test set comprising 288 tasks that feature unseen visual signals and referring expressions, yet remain solvable with motions demonstrated during training. In contrast, traditional VLAs achieve an average success rate of 22%, even though they overfit the training set with a 90% success rate.