Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs
2026-05-01 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial Intelligence
AI summaryⓘ
The authors found that large vision-language models struggle to keep paying attention to images the longer they generate text, a problem they call 'Visual Signal Dilution.' To fix this, they created a small add-on called Persistent Visual Memory (PVM) that helps the model remember visual information consistently, no matter how long the text gets. This add-on works alongside existing parts of the model to provide direct access to visual details, improving performance especially on tasks that need complex reasoning with images. Their tests show that PVM boosts accuracy without making the model much bigger and helps it make predictions faster.
Large Vision-Language ModelsVisual Signal DilutionPersistent Visual MemoryFeed-Forward NetworkVisual EmbeddingsAttention MechanismMultimodal TasksSequence LengthQwen3-VLReasoning Tasks
Authors
Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Zefeng He, Muxin Fu, Daizong Liu, Wei-Long Zheng, Yu Cheng
Abstract
While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for precise visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM can resist length-induced signal decay and accelerate internal prediction convergence.