V-CAGE: Vision-Closed-Loop Agentic Generation Engine for Robotic Manipulation

2026-04-10Robotics

Robotics
AI summary

The authors developed V-CAGE, a system that helps robots create their own training videos with scenes that make sense and are physically possible to navigate. Instead of following fixed scripts, V-CAGE acts like an intelligent agent that plans out scenes carefully and checks its own work to avoid mistakes. It also compresses large video files efficiently without losing important details needed for teaching robots. This makes it easier to produce lots of useful data for training robot vision and movement.

Vision-Language-Action modelsscene generationsemantic coherencekinematic reachabilityfoundation modelsInpainting-Guided Scene Constructionclosed-loop verificationfunctional metadataperceptual compressionrobotic manipulation datasets
Authors
Yaru Liu, Ao-bo Wang, Nanyang Ye
Abstract
Scaling Vision-Language-Action (VLA) models requires massive datasets that are both semantically coherent and physically feasible. However, existing scene generation methods often lack context-awareness, making it difficult to synthesize high-fidelity environments embedded with rich semantic information, frequently resulting in unreachable target positions that cause tasks to fail prematurely. We present V-CAGE (Vision-Closed-loop Agentic Generation Engine), an agentic framework for autonomous robotic data synthesis. Unlike traditional scripted pipelines, V-CAGE operates as an embodied agentic system, leveraging foundation models to bridge high-level semantic reasoning with low-level physical interaction. Specifically, we introduce Inpainting-Guided Scene Construction to systematically arrange context-aware layouts, ensuring that the generated scenes are both semantically structured and kinematically reachable. To ensure trajectory correctness, we integrate functional metadata with a Vision-Language Model based closed-loop verification mechanism, acting as a visual critic to rigorously filter out silent failures and sever the error propagation chain. Finally, to overcome the storage bottleneck of massive video datasets, we implement a perceptually-driven compression algorithm that achieves over 90\% filesize reduction without compromising downstream VLA training efficacy. By centralizing semantic layout planning and visual self-verification, V-CAGE automates the end-to-end pipeline, enabling the highly scalable synthesis of diverse, high-quality robotic manipulation datasets.