DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models

2026-03-17Robotics

Robotics
AI summary

The authors present DreamPlan, a new method to improve robot manipulation by combining big vision-language models (VLMs) with learned video simulations. Instead of training robots directly in the real world, which is slow and risky, their approach uses videos generated by a model that learns how objects behave from exploratory data. This video 'imagination' lets them fine-tune the robot planner safely and efficiently. The method helps the robot better understand physical tasks like handling soft objects without needing lots of real-world practice.

Robotic manipulationVision-Language Models (VLMs)Reinforcement Learning (RL)Video world modelsZero-shot planningAction-conditioned video generationPolicy optimizationPhysical groundingDeformable object manipulationSample efficiency
Authors
Emily Yue-Ting Jia, Weiduo Yuan, Tianheng Shi, Vitor Guizilini, Jiageng Mao, Yue Wang
Abstract
Robotic manipulation requires sophisticated commonsense reasoning, a capability naturally possessed by large-scale Vision-Language Models (VLMs). While VLMs show promise as zero-shot planners, their lack of grounded physical understanding often leads to compounding errors and low success rates when deployed in complex real-world environments, particularly for challenging tasks like deformable object manipulation. Although Reinforcement Learning (RL) can adapt these planners to specific task dynamics, directly fine-tuning VLMs via real-world interaction is prohibitively expensive, unsafe, and sample-inefficient. To overcome this bottleneck, we introduce DreamPlan, a novel framework for the reinforcement fine-tuning of VLM planners via video world models. Instead of relying on costly physical rollouts, DreamPlan first leverages the zero-shot VLM to collect exploratory interaction data. We demonstrate that this sub-optimal data is sufficient to train an action-conditioned video generation model, which implicitly captures complex real-world physics. Subsequently, the VLM planner is fine-tuned entirely within the "imagination" of this video world model using Odds Ratio Policy Optimization (ORPO). By utilizing these virtual rollouts, physical and task-specific knowledge is efficiently injected into the VLM. Our results indicate that DreamPlan bridges the gap between semantic reasoning and physical grounding, significantly improving manipulation success rates without the need for large-scale real-world data collection. Our project page is https://psi-lab.ai/DreamPlan/.