UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception

2026-04-15Robotics

RoboticsArtificial Intelligence
AI summary

The authors present UMI-3D, an improved version of a wrist-mounted data collection tool for robot manipulation. Unlike the original UMI that relied only on cameras and struggled with things like occlusions or moving scenes, UMI-3D adds a small LiDAR sensor to better track the environment in 3D. This combination helps produce more accurate and reliable data, which leads to better robot learning and success on hard tasks like handling soft or jointed objects. Their system remains portable and easy to use, and all the technical designs are shared openly to help other researchers.

Universal Manipulation Interface (UMI)LiDARSLAM (Simultaneous Localization and Mapping)embodied manipulationmultimodal sensingpose estimationspatiotemporal calibrationvisuomotor policydeformable object manipulationarticulated object operation
Authors
Ziming Wang
Abstract
We present UMI-3D, a multimodal extension of the Universal Manipulation Interface (UMI) for robust and scalable data collection in embodied manipulation. While UMI enables portable, wrist-mounted data acquisition, its reliance on monocular visual SLAM makes it vulnerable to occlusions, dynamic scenes, and tracking failures, limiting its applicability in real-world environments. UMI-3D addresses these limitations by introducing a lightweight and low-cost LiDAR sensor tightly integrated into the wrist-mounted interface, enabling LiDAR-centric SLAM with accurate metric-scale pose estimation under challenging conditions. We further develop a hardware-synchronized multimodal sensing pipeline and a unified spatiotemporal calibration framework that aligns visual observations with LiDAR point clouds, producing consistent 3D representations of demonstrations. Despite maintaining the original 2D visuomotor policy formulation, UMI-3D significantly improves the quality and reliability of collected data, which directly translates into enhanced policy performance. Extensive real-world experiments demonstrate that UMI-3D not only achieves high success rates on standard manipulation tasks, but also enables learning of tasks that are challenging or infeasible for the original vision-only UMI setup, including large deformable object manipulation and articulated object operation. The system supports an end-to-end pipeline for data acquisition, alignment, training, and deployment, while preserving the portability and accessibility of the original UMI. All hardware and software components are open-sourced to facilitate large-scale data collection and accelerate research in embodied intelligence: \href{https://umi-3d.github.io}{https://umi-3d.github.io}.