In-Context World Modeling for Robotic Control
2026-06-24 • Robotics
RoboticsComputer Vision and Pattern Recognition
AI summaryⓘ
The authors address a problem where robots using Vision-Language-Action models struggle to work well when things change, like camera angles or robot shapes. They propose a method called In-Context World Modeling (ICWM) that helps robots figure out how their system works by observing their own actions before starting a task. This way, the robots can adapt to new setups without needing extra training. Their tests in simulations and real robots show that ICWM performs much better than usual approaches when the environment changes unexpectedly.
Vision-Language-Action modelssystem identificationin-context learningrobot adaptationworld dynamicsself-generated interactionsrobot morphologycamera viewpointssimulationfine-tuning
Authors
Siyin Wang, Junhao Shi, Senyu Fei, Zhaoyang Fu, Li Ji, Jingjing Gong, Xipeng Qiu
Abstract
Modern Vision-Language-Action (VLA) models often fail to generalize to novel setups, such as altered camera viewpoints or robot morphologies, because they are typically conditioned only on current observations and language instructions. By ignoring the underlying system configuration as a variable, these models implicitly assume a fixed execution context encountered during training, necessitating data-intensive fine-tuning for any new environment. In this work, we introduce In-Context World Modeling (ICWM), a framework that treats system identification as an in-context adaptation problem. ICWM enables robot policies to autonomously infer essential system variables from a short history of self-generated, task-agnostic interactions. Unlike traditional In-Context Learning that uses demonstrations to specify what task to perform, ICWM leverages the context window to understand how the system operates. By processing these interactions before task execution, the model implicitly captures the world dynamics of the current system, enabling adaptation to novel configurations without parameter updates. Extensive experiments in simulation and on real-world robot platforms demonstrate that ICWM significantly outperforms standard VLA baselines on novel camera viewpoints.