AI summaryⓘ
The authors explore how robots can better follow natural language instructions by verifying their actions at test time, which helps reduce mistakes between what the robot intends and what it actually does. They find that generating many rephrased instructions and possible actions together improves the chance of the robot choosing the right action. To implement this, they develop CoVer, a system that uses a contrastive method to check alignment between language and actions, improving performance with more data and computing power. Their method shows important improvements in both simulated and real-world robot tasks compared to just training the robot longer. Overall, they contribute a scalable way for robots to double-check and select better actions based on language instructions.
Vision-Language-Action (VLA) modelsnatural language instructionstest-time verificationscaling lawscontrastive verifierVision-Language-Model (VLM)SIMPLER benchmarkPolaRiS benchmarkrobot instruction followinghierarchical verification
Authors
Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, Marco Pavone
Abstract
The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.'' We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce "boot-time compute" and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.