FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model

2026-07-01Robotics

RoboticsArtificial Intelligence
AI summary

The authors studied how to teach robots with two arms to assemble full-size furniture using models that combine vision, language, and action. They created a new simulated environment and a VR system for collecting real-world examples of furniture assembly. Their model can track progress through many steps and tasks, which helps it avoid mistakes. Their method improved success rates in simulation and worked well on a real robot, with only a small drop in performance on the toughest tasks.

bimanual manipulationfurniture assemblyvision-language-action modelslong-horizon tasksVR teleoperationrobot controlsimulation pipelineprogress predictionKinova Gen3 robot
Authors
Chenyang Ma, Yue Yang, Radu Corcodel, Siddarth Jain, Andrew Wu, Chiori Hori, Diego Romeres
Abstract
Current work on robot furniture assembly mostly focuses on toy-scale settings or single-arm manipulation. We introduce FurnitureVLA, the first systematic study of real-scale bimanual furniture assembly using Vision-Language-Action models (VLAs). We formalize the task, develop a scalable simulation pipeline for expert data generation and evaluation, and build a VR teleoperation system for single-operator bimanual control to collect high-quality real-world demonstrations. To address extreme long-horizon assembly with up to 7 subtasks and 1550 control steps, we propose a progress-enhanced VLA, finetuned on semantically grounded subtasks, that jointly predicts actions and a continuous progress signal, enabling automatic subtask transitions and reducing compounding errors during inference. We further study perception and control design factors that critically affect precision in real-scale assembly. FurnitureVLA improves average simulation success from 48% to 80% compared to baselines across three furniture types, with an additional 21% gain from our design factor study. We validate on a real Kinova Gen3 platform with only 16% drop on the hardest task.