SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring

2026-04-28Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors address how multimodal large language models (MLLMs) can better decide when to answer questions about images, especially in tricky real-world cases they haven't seen before. They create a system called SIEVES that helps the model check if its visual explanation for an answer is reliable, so it only responds when confident. This approach allows the model to answer more questions correctly without guessing too much, even for data and models it wasn't specifically trained on. The method works well across various challenging test sets and different reasoning models, improving prediction coverage while keeping errors low.

Multimodal Large Language Models (MLLMs)Visual Question Answering (VQA)Out-of-Distribution (OOD)Selective PredictionConfidence ScoringVisual Evidence LocalizationCoverageReasoner ModelsSIEVESTransfer Learning
Authors
Hector G. Rodriguez, Marcus Rohrbach
Abstract
Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-language tasks. Even as traditional visual question answering benchmarks approach saturation, reliable deployment requires satisfying low error tolerances in real-world out-of-distribution (OOD) scenarios. Precisely, selective prediction aims to improve coverage, i.e. the share of inputs the system answers, while adhering to a user-defined risk level. This is typically achieved by assigning a confidence score to each answer and abstaining on those that fall below a certain threshold. To enable reliable generalization, we require reasoner models to produce localized visual evidence while answering, and design a selector that explicitly learns to estimate the quality of the localization provided by the reasoner. We show that SIEVES (Selective Prediction through Visual Evidence Scoring) improves coverage by up to three times on challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA), compared to non-grounding baselines. Beyond better generalization to OOD tasks, the design of the SIEVES selector enables transfer to proprietary reasoners without access to their weights or logits, such as o3 and Gemini-3-Pro, providing coverage boosts beyond those attributable to accuracy alone. We highlight that SIEVES generalizes across all five tested OOD datasets and reasoner models (Pixel-Reasoner, o3, and Gemini-3-Pro), without benchmark- or reasoner-specific training or adaptation.