LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation
2026-04-09 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors address how robots can better handle new and unseen tasks involving objects by improving their understanding of 3D relationships between things. They point out that existing AI models struggle because they mostly use 2D information and language, which isn't enough for detailed manipulation. Their method, LAMP, uses clues from image editing to create 3D spatial guides that help the robot understand how objects relate in space. Experiments show their approach helps robots perform new tasks more accurately without prior training on those tasks.
robotic manipulationgeneralization3D transformationsimage editingspatial relationslarge language modelsvision-language modelszero-shot learninggeometry-aware representationsopen-world tasks
Authors
Jingjing Wang, Zhengdong Hong, Chong Bao, Yuke Zhu, Junhan Sun, Guofeng Zhang
Abstract
Human-like generalization in open-world remains a fundamental challenge for robotic manipulation. Existing learning-based methods, including reinforcement learning, imitation learning, and vision-language-action-models (VLAs), often struggle with novel tasks and unseen environments. Another promising direction is to explore generalizable representations that capture fine-grained spatial and geometric relations for open-world manipulation. While large-language-model (LLMs) and vision-language-model (VLMs) provide strong semantic reasoning based on language or annotated 2D representations, their limited 3D awareness restricts their applicability to fine-grained manipulation. To address this, we propose LAMP, which lifts image-editing as 3D priors to extract inter-object 3D transformations as continuous, geometry-aware representations. Our key insight is that image-editing inherently encodes rich 2D spatial cues, and lifting these implicit cues into 3D transformations provides fine-grained and accurate guidance for open-world manipulation. Extensive experiments demonstrate that \codename delivers precise 3D transformations and achieves strong zero-shot generalization in open-world manipulation. Project page: https://zju3dv.github.io/LAMP/.