Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

2026-04-07 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors created a simple method to help AI models, originally trained on regular color images, understand thermal images taken by drones. They tested this method on a new thermal image dataset and found that one model, Qwen3-VL-8B-Instruct, was very good at recognizing animals like deer, rhinos, and elephants. By using both thermal and color images together, the models could also describe the environment and spot signs of human activity. This work shows it is possible to adapt existing AI to new types of images without major retraining, which can help with monitoring nature.

Visual Language Models (VLMs)Thermal infrared imageryMultimodal adaptationDrone imageryFine-tuningSpecies recognitionInstance enumerationOpen-set promptingHabitat-context interpretationMultimodal projector alignment

Authors

Hao Chen, Fang Qiu, Fangchao Dong, Defei Yang, Eve Bohnett, Li An

Abstract

This study proposes a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained VLMs and thermal infrared imagery, and demonstrates its practical utility using a real drone-collected dataset. A thermal dataset was developed from drone-collected imagery and was used to fine-tune VLMs through multimodal projector alignment, enabling the transfer of information from RGB-based visual representations to thermal radiometric inputs. Three representative models, including InternVL3-8B-Instruct, Qwen2.5-VL-7B-Instruct, and Qwen3-VL-8B-Instruct, were benchmarked under both closed-set and open-set prompting conditions for species recognition and instance enumeration. Among the tested models, Qwen3-VL-8B-Instruct with open-set prompting achieved the best overall performance, with F1 scores of 0.935 for deer, 0.915 for rhino, and 0.968 for elephant, and within-1 enumeration accuracies of 0.779, 0.982, and 1.000, respectively. In addition, combining thermal imagery with simultaneously collected RGB imagery enabled the model to generate habitat-context information, including land-cover characteristics, key landscape features, and visible human disturbance. Overall, the findings demonstrate that lightweight projector-based adaptation provides an effective and practical route for transferring RGB-pretrained VLMs to thermal drone imagery, expanding their utility from object-level recognition to habitat-context interpretation in ecological monitoring.

View PDFOpen arXiv