A Unified Framework for Efficient Remote Sensing Visual Question Answering: Adapting Dual, Hybrid, and Encoder-Decoder Architectures

2026-06-17 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors studied how to make answering questions about aerial images easier and more efficient. They looked at three different kinds of vision-language models and used a special method that only changes a small part of the model to save computing power. Their approach added small adapters to frozen parts of the models, allowing quick learning with minimal updates. Tests showed that one model type, Hybrid FLAVA, worked best for understanding and retrieving information from high-resolution remote sensing images. This work helps set a more efficient standard for using AI in areas like disaster assessment and city monitoring.

Visual Question Answering (VQA)Remote Sensing (RS)Foundation ModelsParameter Efficient Fine Tuning (PEFT)Vision Language Models (VLM)CLIPBLIPFLAVAAdaptersMultimodal Reasoning

Authors

Timothy Agboada, Shikha Chandel, Yadav Raj Ghimire, Leila Hashemi-Beni

Abstract

Visual Question Answering (VQA) in the Remote Sensing (RS) domain presents unique challenges due to the high resolution, multi scale object distribution, and semantic complexity of aerial imagery. While general domain Foundation Models have achieved remarkable success, their direct application to RSVQA is hindered by massive domain shifts and the computationally prohibitive nature of full fine tuning. This study presents a comparative analysis of RS Adapter, a Parameter Efficient Fine Tuning (PEFT) strategy, applied across three distinct Vision Language Model (VLM) architectures: the Dual Encoder CLIP, the Encoder Decoder BLIP, and the Hybrid FLAVA. We introduce a unified architectural surgery pipeline that injects lightweight bottleneck adapters into the attention and MLP layers of frozen backbones, enabling rapid adaptation with less than 5 percent of trainable parameters. Experimental results on the high resolution RSVQA x dataset demonstrate that while all adapted models achieve convergence, the Hybrid FLAVA architecture offers a superior balance of multimodal reasoning and retrieval capabilities compared to its unimodal counterparts. Our findings establish a new baseline for resource efficient VQA in disaster assessment and urban monitoring.

View PDFOpen arXiv