Towards Robustness against Typographic Attack with Training-free Concept Localization
2026-07-02 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionComputation and Language
AI summaryⓘ
The authors studied a problem where vision models, like those in many AI systems, get confused by text inside images and focus on the words instead of what's actually in the picture. They created a new way to understand which parts of these models cause this confusion without needing to retrain them. By pinpointing and tweaking these parts directly, they improved the models' ability to ignore distracting text and better recognize objects. Their method made several advanced AI models more accurate when tested with tricky images containing irrelevant text.
CLIPVision Transformer (ViT)Contrastive Language-Image PretrainingTypographic Attackmechanistic interpretabilityattention headsVisual Question Answeringobject classificationlatent representations
Authors
Bohan Liu, Wenqian Ye, Guangzhi Xiong, Zhenghao He, Sanchit Sinha, Aidong Zhang
Abstract
Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP models exhibit a critical yet underexplored failure mode: irrelevant text appearing within images confounds visual representations, biasing them toward lexical meaning rather than true visual semantics. This robustness issue, commonly described as a Typographic Attack (TA), exposes a vulnerability that poses a significant risk to safety-critical applications such as autonomous driving. To achieve interpretable and effective robustness against TA, we propose a novel, training-free mechanistic interpretability method. Our method provides sampling-based interpretations of hidden state representations and quantitatively attributes semantic versus lexical focus to individual attention heads. Through probabilistic analysis and circuit mining, we isolate specific Vision Transformer (ViT) components that disproportionately encode lexical information, thereby identifying the mechanistic source of TA. We further show that simple interventions applied directly to the identified circuits, without any additional training, can substantially improve robustness against Typographic Attacks in object classification. These interventions, such as selective adjustment of attention weights, also outperform both supervised and training-free defense methods. Our experiments demonstrate that applying the proposed intervention to the vision encoders of several state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy under Typographic Attack interference on RIO-Bench. These results confirm both the efficacy and the generalizability of our mechanistic approach. Code is released at https://github.com/Liu-524/SamplingTAR.