Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

2026-07-02Computation and Language

Computation and LanguageComputer Vision and Pattern Recognition
AI summary

The authors studied large vision-language models and how they can improve by rethinking their own answers, a process called self-reflection. They found that these models often ignore visual details when trying to fix past mistakes, especially with unfamiliar images. To help, the authors created a new training method called VRRL that teaches models to focus better by hiding parts of the problem and showing different types of errors to fix. Their approach improved accuracy when the models faced new or unusual images, outperforming other training methods. This work was tested on tasks like interpreting tables, charts, and navigating spaces.

vision-language modelschain of thoughtself-reflectionreinforcement learningexperience replayout-of-distributionvisual groundingtrajectory maskingspatial navigation
Authors
Liyan Tang, Fangcong Yin, Greg Durrett
Abstract
Large vision-language models can reason over multimodal inputs by generating textual chains of thought (CoT). A key capability exhibited in CoT reasoning is self-reflection: revisiting earlier decisions and correcting previous errors. However, existing LVLMs often fail to properly attend to visual inputs during reflection, limiting their ability to translate feedback into grounded corrections, especially for out-of-distribution images. To address this issue, we propose a novel reinforcement learning training framework VRRL, with two components explicitly designed to elicit visually grounded self-reflection. First, we randomly mask trajectory prefixes during training to emphasize recovery from incorrect intermediate predictions rather than making early mistakes. Second, we introduce buffered roll-ins from an experience replay buffer to expose the model to diverse failure states that it must learn to correct. We evaluate our approach on visual grounding tasks involving tables and charts, as well as spatial navigation benchmarks. While off-the-shelf and conventionally fine-tuned models degrade substantially under distribution shift, our method substantially improves average out-of-distribution accuracy over standard RL and reflection-oriented fine-tuning baselines by using self-reflection effectively.