HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning

2026-02-24Robotics

Robotics
AI summary

The authors developed HALO, a new model for robots that combines understanding text, predicting future images, and deciding actions all in one system to better perform tasks. HALO uses a special transformer setup that splits different thinking steps but lets them work together smoothly. They created a way to teach HALO using automatically made training data. Their tests show HALO works better than previous methods, especially in new or complex environments. Overall, the authors improved how robots can plan and act by making their reasoning more like how humans think through problems step-by-step.

Vision-Language-Action (VLA) modelsEmbodied multimodal chain-of-thought (EM-CoT)Transformer architectureVisual subgoal predictionRobotic manipulationMultimodal reasoningRoboTwin benchmarkOut-of-distribution generalizationMixture-of-Transformers (MoT)Automated training data synthesis
Authors
Quanxin Shou, Fangqi Zhu, Shawn Chen, Puxin Yan, Zhengyang Yan, Yikun Miao, Xiaoyi Pang, Zicong Hong, Ruikai Shi, Hao Huang, Jie Zhang, Song Guo
Abstract
Vision-Language-Action (VLA) models have shown strong performance in robotic manipulation, but often struggle in long-horizon or out-of-distribution scenarios due to the lack of explicit mechanisms for multimodal reasoning and anticipating how the world will evolve under action. Recent works introduce textual chain-of-thought or visual subgoal prediction within VLA models to reason, but still fail to offer a unified human-like reasoning framework for joint textual reasoning, visual foresight, and action prediction. To this end, we propose HALO, a unified VLA model that enables embodied multimodal chain-of-thought (EM-CoT) reasoning through a sequential process of textual task reasoning, visual subgoal prediction for fine-grained guidance, and EM-CoT-augmented action prediction. We instantiate HALO with a Mixture-of-Transformers (MoT) architecture that decouples semantic reasoning, visual foresight, and action prediction into specialized experts while allowing seamless cross-expert collaboration. To enable HALO learning at scale, we introduce an automated pipeline to synthesize EM-CoT training data along with a carefully crafted training recipe. Extensive experiments demonstrate that: (1) HALO achieves superior performance in both simulated and real-world environments, surpassing baseline policy pi_0 by 34.1% on RoboTwin benchmark; (2) all proposed components of the training recipe and EM-CoT design help improve task success rate; and (3) HALO exhibits strong generalization capabilities under aggressive unseen environmental randomization with our proposed EM-CoT reasoning.