Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

2026-02-13 • Robotics

Robotics

AI summaryⓘ

The authors worked on improving how robots use vision-language models (VLMs) to understand and perform tasks. Instead of just giving robots simple language commands, they trained special policies that respond to detailed instructions at different levels, like specific movements or pixel locations. This makes the robot's actions more precise and helps it handle new or complex tasks better. They tested their approach in real-world robot tasks and found it outperformed previous methods.

pretrained vision-language modelsrobot controlvision-language-action modelssteerable policieshierarchical controllow-level behaviorin-context learningtask generalizationrobotic manipulation

Authors

William Chen, Jagdeep Singh Bhatia, Catherine Glossop, Nikhil Mathihalli, Ria Doshi, Andy Tang, Danny Driess, Karl Pertsch, Sergey Levine

Abstract

Pretrained vision-language models (VLMs) can make semantic and visual inferences across diverse settings, providing valuable common-sense priors for robotic control. However, effectively grounding this knowledge in robot behaviors remains an open challenge. Prior methods often employ a hierarchical approach where VLMs reason over high-level commands to be executed by separate low-level policies, e.g., vision-language-action models (VLAs). The interface between VLMs and VLAs is usually natural language task instructions, which fundamentally limits how much VLM reasoning can steer low-level behavior. We thus introduce Steerable Policies: VLAs trained on rich synthetic commands at various levels of abstraction, like subtasks, motions, and grounded pixel coordinates. By improving low-level controllability, Steerable Policies can unlock pretrained knowledge in VLMs, enabling improved task generalization. We demonstrate this benefit by controlling our Steerable Policies with both a learned high-level embodied reasoner and an off-the-shelf VLM prompted to reason over command abstractions via in-context learning. Across extensive real-world manipulation experiments, these two novel methods outperform prior embodied reasoning VLAs and VLM-based hierarchical baselines, including on challenging generalization and long-horizon tasks. Website: steerable-policies.github.io

View PDFOpen arXiv