HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

2026-04-15 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceRobotics

AI summaryⓘ

The authors present HiVLA, a new approach for robots that combines vision and language to improve manipulation tasks. Unlike typical models that mix understanding and control together, their method separates high-level planning from low-level actions. The planner uses a vision-language model to break down tasks and find objects, while a special action expert handles precise movements. This way, the robot keeps strong reasoning skills and can execute tasks more accurately, especially with complicated or small objects. Tests show HiVLA works better than existing models on difficult, multi-step tasks.

Vision-Language ModelsRobotic ManipulationTask DecompositionVisual GroundingDiffusion TransformerCross-Attention MechanismZero-shot ReasoningLong-horizon Skill CompositionAction ExecutionHierarchical Framework

Authors

Tianshuo Yang, Guanyu Chen, Yutian Chen, Zhixuan Liang, Yitian Liu, Zanxin Chen, Chunpu Xu, Haotian Liang, Jiangmiao Pang, Yao Mu, Ping Luo

Abstract

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.

View PDFOpen arXiv