VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

2026-05-27 • Computation and Language

Computation and Language

AI summaryⓘ

The authors studied whether training language models with both text and images helps them understand language more like humans do during reading. They compared pairs of models trained only on text with those trained on both text and images, using brain scans and eye movement data from people reading naturally. Their results show that adding visual training doesn't always make models more human-like overall, but it can help when sentences have strong visual meaning. This suggests that knowing language well is still the main factor for matching human reading patterns, with visual training helping only in specific cases.

Large Language ModelsVision-Language ModelsMultimodal PretrainingNatural ReadingfMRIEye-trackingSaccadesHuman Language ProcessingModel-Human AlignmentVisual Semantic Content

Authors

Jinzhou Wu, Zhengwu Ma, Jixing Li, Baoping Tang, Zitong Lu

Abstract

Large language models (LLMs) have become increasingly useful computational models of human language processing, but it remains unclear whether vision-language learning makes text representations more human-like during natural reading. Here, we address this question by comparing tightly matched LLM and vision-language model (VLM) pairs under a strictly text-only setting, allowing us to isolate the effect of multimodal training history from online visual input or cross-modal fusion. We evaluate model alignment with a human natural-reading dataset that includes whole-cortex fMRI responses and synchronized eye-tracking saccades. Our findings demonstrate that multimodal pretraining may not confer a uniform, global advantage in human alignment during natural reading, indicating that language-internal representations remain the key factor for modeling human text processing. However, the VLM advantage could emerge more selectively when sentences contain stronger visual semantic content, with converging evidence from both fMRI and eye-movement alignments. Together, our findings provide a controlled in silico framework for testing how visual learning history shapes model-human alignment of language processing, suggesting that multimodal pretraining contributes selectively rather than globally to human-like language representations during natural reading.

View PDFOpen arXiv