Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery

2026-05-26Computation and Language

Computation and Language
AI summary

The authors studied whether vision-language models (VLMs) actually use pictures to better understand word meanings. They found that real images don’t always help and sometimes even confuse the models, especially for words that don’t relate strongly to visuals. Their analyses showed that images can cause the models to pick up on irrelevant visual details, making it harder for them to judge word meanings accurately. They also found that telling the models to focus only on text during testing helps reduce these problems. Overall, the authors suggest that current models need to better decide when to use images for understanding words.

vision-language modelsmultimodal modelslexical judgmentsconcreteness ratingsimagery ratingscanonical correlation analysisinstruction tuningrepresentational shiftsspurious visual cues
Authors
Yifan Jiang, Ruoxi Ning, Sheng Yao, Freda Shi
Abstract
Visual inputs are often assumed to improve language understanding in multimodal models. We examine this assumption by asking whether vision-language models (VLMs) can distinguish useful visual evidence from incidental image context in lexical judgments. We use human concreteness and imagery ratings because they span words with varying expected visual relevance, from abstract and low-imagery words to concrete and high-imagery words. We find that real-image contexts do not yield consistent gains and often hurt alignment with human ratings, most sharply when visual evidence is least relevant. Through probing and canonical correlation analysis, complemented by an attribution case study, we find that real-image contexts are associated with representational shifts and greater sensitivity to spurious visual cues, coinciding with weaker recoverability of the targeted lexical properties. We further show that instructing models to focus solely on textual content at inference time can reduce this degradation, with the clearest gains on these vulnerable subsets. Our findings suggest that current instruction-tuned VLMs need better calibration of when visual context should inform lexical judgments.