RoboWits: Unexpected Challenges for Robotic Creative Problem Solving

2026-05-28 • Robotics

RoboticsArtificial Intelligence

AI summaryⓘ

The authors created RoboWits, a new test for robots that focuses on their ability to think, adapt, and use tools creatively when things don’t go as expected. They built a system that automatically generates many different tasks to challenge robots in reasoning and problem-solving. When testing popular robot programs, they found these programs do okay on simple tasks but struggle with harder, changed ones, showing they aren't very flexible or robust yet. This benchmark helps highlight where robots need to improve in real-world problem-solving.

robotic benchmarkscognitive reasoningcreative tool usetask generationmanipulation tasksvisual-language agentsmulti-agent systemstask mutationrobustnessoracle-state planners

Authors

Chunru Lin, Hongxin Zhang, Fenghao Yu, Zhehuan Chen, Thomas L. Griffiths, Yejin Choi, David Held, Chuang Gan

Abstract

The ability to reason, adapt, and creatively solve problems under unexpected challenges is essential for robots operating in real-world environments. However, current robotic benchmarks primarily emphasize skill-level execution and provide limited insight into such cognitive reasoning capabilities. We introduce RoboWits, a bi-manual robotic benchmark designed to systematically evaluate cognitive reasoning, creative tool use, and robustness to unexpected conditions. To enable scalable construction of high-quality reasoning-centric unexpected scenarios, we propose an automated task generation pipeline formulated as a multi-agent cooperative framework, comprising agents for seed task generation and verification, metric generation, scene generation, and task mutation. Using the pipeline, we curated 30 diverse seed tasks and 208 tasks with mutations and graded difficulty across geometry, material, and assembly-based reasoning. We benchmark popular robot policies, pre-trained VLAs, and oracle-state planners. Our results reveal a significant performance gap: while pre-trained VLAs exhibit preliminary success on seed tasks after single-task fine-tuning, they struggle to perform on mutated tasks, implying their brittleness in manipulation tasks requiring reasoning, strategy adaptation, and robustness to deceptive or constrained environments. Project page is available at https://umass-embodied-agi.github.io/RoboWits.

View PDFOpen arXiv