GROW$^2$: Grounding Which and Where for Robot Tool Use
2026-06-29 • Robotics
RoboticsArtificial IntelligenceComputer Vision and Pattern Recognition
AI summaryⓘ
The authors created GROW², a system that helps robots figure out how to use everyday objects as tools, even if those objects aren't usually meant for the job. They do this by first understanding the task and choosing the right object and parts using language understanding, then accurately finding those parts in 3D space from images. This two-step approach avoids needing lots of training data and works well on new, unseen objects. Their tests show that GROW² performs better than current methods in predicting how tools can be used and in real robot experiments.
robot tool useaffordance groundingVision-Language ModelsRGB-D imageobject parts3D region localizationzero-shot generalizationfoundation modelssemantic parsinggeometric grounding
Authors
Yuhong Deng, Yuyao Liu, David Hsu
Abstract
Can the robot use a plate to cut a cake if no knife is available? Tool use greatly expands robot capabilities, but to use tools creatively beyond their intended functions, the robot faces the challenge of $\textit{open-world affordance grounding}$: select an open-category object to act as a tool and localize its specific region of action. To this end, we introduce GROW$^2$ (GROunding Which and Where), which leverages object parts as a natural abstraction to split the grounding process hierarchically into semantic and geometric levels, thus bypassing the need for data-heavy, end-to-end training. Semantically, GROW$^2$ harnesses the commonsense reasoning of Vision-Language Models (VLMs) to parse a natural-language task instruction, select a suitable object as the tool, and identify task-relevant parts on the tool and the target object. Geometrically, vision foundation models then ground the selected parts into precise 3D regions from a single RGB-D image. Experiments on established benchmarks show that GROW$^2$ outperforms state-of-the-art baselines on affordance prediction benchmarks. Further, it achieves zero-shot generalization over open-category objects and outperforms baselines in both simulated and real-world robot tool use experiments.