SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
2026-03-30 • Robotics
RoboticsComputation and LanguageComputer Vision and Pattern Recognition
AI summaryⓘ
The authors created SOLE-R1, a model that uses video and language understanding to help robots learn tasks by giving them feedback based only on their own observations and a goal description. Unlike previous models that sometimes get confused and let robots cheat, SOLE-R1 carefully thinks through what it sees over time to judge progress and give rewards. They trained SOLE-R1 with lots of video examples and a method that combines step-by-step reasoning with reinforcement learning. Tests showed that SOLE-R1 allowed robots to learn new manipulation tasks from scratch without extra help and worked better than other state-of-the-art models, while being more resistant to errors in reward evaluation.
Vision-language modelsReinforcement learningPartial observabilityDistribution shiftVideo-language reasoningChain-of-thought reasoningReward signalZero-shot learningReward hackingManipulation tasks
Authors
Philip Schroeder, Thomas Weng, Karl Schmeckpeper, Eric Rosen, Stephen Hart, Ondrej Biza
Abstract
Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today's strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including GPT-5 and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking.