Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs

2026-02-24Machine Learning

Machine LearningArtificial IntelligenceComputation and LanguageComputer Vision and Pattern RecognitionRobotics
AI summary

The authors developed a method called Reflective Test-Time Planning to help robots learn from their mistakes more like humans do. Their approach lets robots think carefully before acting (reflection-in-action) and learn from their actions afterward (reflection-on-action). They also enable the robot to look back and update earlier decisions to improve long-term planning. Tests on new benchmarks and real robots showed that this reflection helps robots perform better and fix their behavior. The authors' experiments confirm that both thinking before and after actions contribute to better robot learning.

Embodied LLMsreflection-in-actionreflection-on-actiontest-time traininglong-horizon planningretrospective reflectionrobot learningbenchmarkingMuJoCobehavioral correction
Authors
Yining Hong, Huang Huang, Manling Li, Li Fei-Fei, Jiajun Wu, Yejin Choi
Abstract
Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a sequence of independent trials where mistakes repeat rather than accumulate into experience. Drawing upon human reflective practitioners, we introduce Reflective Test-Time Planning, which integrates two modes of reflection: \textit{reflection-in-action}, where the agent uses test-time scaling to generate and score multiple candidate actions using internal reflections before execution; and \textit{reflection-on-action}, which uses test-time training to update both its internal reflection model and its action policy based on external reflections after execution. We also include retrospective reflection, allowing the agent to re-evaluate earlier decisions and perform model updates with hindsight for proper long-horizon credit assignment. Experiments on our newly-designed Long-Horizon Household benchmark and MuJoCo Cupboard Fitting benchmark show significant gains over baseline models, with ablative studies validating the complementary roles of reflection-in-action and reflection-on-action. Qualitative analyses, including real-robot trials, highlight behavioral correction through reflection.