DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

2026-06-10 • Robotics

RoboticsArtificial IntelligenceComputer Vision and Pattern Recognition

AI summaryⓘ

The authors study how Vision-Language Models (VLMs), used to guide robots, often get slower and more expensive when scaled up, but don't always get better at tasks. They propose DIRECT, a smart system that decides how much computing power to use based on the scene, making the robot's planning more efficient. Their tests show that directing compute this way can keep or improve success rates while using less time and resources. This approach helps make advanced robot planning more practical in the real world.

Vision-Language ModelsEmbodied AgentsTest-time ComputeChain-of-ThoughtModel ScalingCompute RoutingRobot ManipulationLatencyFLOPsPareto Frontier

Authors

Jadelynn Dao, Milan Ganai, Yasmina Abukhadra, Ajay Sridhar, Mozhgan Nasr Azadani, Katie Luo, Clark Barrett, Jiajun Wu, Chelsea Finn, Marco Pavone

Abstract

Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling test-time compute to improve capability. However, we observe that doing so increases latency, token usage, and FLOPs while yielding uneven, often diminishing gains in downstream success, limiting where embodied agents can be deployed. We argue that choosing when and where to spend test-time compute is central to bringing frontier performance to the real world. We introduce DIRECT, a routing framework that uses multimodal scene context to allocate compute per prompt, improving the success--cost Pareto frontier over fixed model selection. Across three dominant scaling axes, namely chain-of-thought depth, model size, and memory history, our experiments on VLABench and RoboMME show that test-time compute is not a uniform lever: different axes yield qualitatively distinct capability gains. We validate these insights on a physical Franka arm in a DROID setup spanning zero-shot manipulation and long-horizon chaining, where our router matches or exceeds a stronger model's success rate at up to 65% lower average latency. Ultimately, our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level embodied planning in robotic systems at a fraction of the cost. Project page can be found at jadee-dao.github.io/direct/.

View PDFOpen arXiv