P\textsuperscript{2}-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization
2026-06-02 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial IntelligenceComputation and LanguageMachine Learning
AI summaryⓘ
The authors focus on reducing hallucinations in Large Vision-Language Models by improving how these models learn from visual information. They introduce a new method called P²-DPO, where the model creates and learns from its own feedback rather than relying on external human corrections, which can miss visual details or be off-target. This method helps the model pay better attention to important image regions and stay reliable even when images are degraded. Tests show that their approach works better than previous methods without needing more human data. It also improves the model's focus and robustness when understanding images.
Large Vision-Language ModelsHallucinationDirect Preference OptimizationPerceptual BottleneckVisual RobustnessOn-policy LearningPreference PairsCalibration LossAttention Region FidelityImage Degradation
Authors
Ruipeng Zhang, Zhihao Li, Haozhang Yuan, C. L. Philip Chen, Tong Zhang
Abstract
Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) aims to learn directly from the corrected preferences provided by humans, thereby addressing the hallucination issue. Despite its success, this paradigm has yet to specifically target the perceptual bottleneck in attended regions or address insufficient Visual Robustness against image degradation. Furthermore, existing preference pairs are often vision-agnostic and their inherently off-policy nature limits their effectiveness in guiding model learning. To address these challenges, we propose Perceptual Processing Direct Preference Optimization (P\textsuperscript{2}-DPO), a novel training paradigm in which the model generates and learns from its own preference pairs, thereby directly addressing the identified visual bottlenecks while inherently avoiding the issues of vision-agnostic and off-policy data. It introduces: (1) an on-policy preference pairs construction method targeting Focus-and-Enhance perception and Visual Robustness, and (2) a well-designed Calibration Loss to precisely align visual signals with the causal generation of text. Experimental results demonstrate that with a comparable amount of training data and cost, P\textsuperscript{2}-DPO outperforms strong baselines that rely on costly human feedback on benchmarks. Furthermore, evaluations on Attention Region Fidelity (ARF) and image degradation scenarios validate the effectiveness of P\textsuperscript{2}-DPO in addressing perceptual bottleneck in attended regions and improving Visual Robustness against degraded inputs.