Gaze Heads: How VLMs Look at What They Describe
2026-06-12 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionComputation and LanguageMachine Learning
AI summaryⓘ
The authors discovered that vision-language models use special parts called gaze heads to focus on different parts of an image while describing it. These gaze heads track which image region the model is talking about and can be controlled to make the model describe any chosen part. By tweaking just a small number of these gaze heads, the authors were able to steer the model's focus and change its description without retraining. This behavior was found across different model sizes and types, showing a practical way to guide these models during use. Their work provides tools and data to explore this further.
vision-language modelattention headsgaze headsimage region trackingmechanistic interpretabilityinference-time interventionmultimodal modelsattention maskcomic strips datasetCOCO dataset
Authors
Rohit Gandikota, David Bau
Abstract
How a vision-language model internally solves the task of describing an image is far from obvious. We find that the model develops a specific mechanism for this: a small set of attention heads in its language-model backbone, which we call gaze heads, whose attention tracks the image region the model is currently describing. We find them with a simple correlation score from a few forward passes, using comic strips as a controlled testbed where narrative order is laid out spatially. These gaze heads do not just track the image tokens being described: redirecting their attention to a chosen region forces the VLM to describe that region instead. A single attention-mask intervention on the top-100 gaze heads, fewer than 9% of all heads, steers the model's answer to any chosen comic panel at 83.1% accuracy, while the same intervention on random heads fails to redirect the answer, and intervening on all heads destroys generation. The same lever also extends to continuous control: switching the gaze target mid-generation makes the model wrap up its current panel description and move to the new one within a few tokens. Beyond comics, the same intervention redirects answers to chosen regions in natural COCO images. The mechanism further recurs across model sizes from 2B to 32B parameters and across other VLM architectures, although some frozen-encoder families show no comparable head set. More broadly, this shows that targeted edits identified through mechanistic analysis can serve as practical inference-time levers for steering multimodal model behavior, without any retraining. Our code, interactive demo, and datasets are available at https://gaze.baulab.info/