ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionMachine Learning
AI summary

The authors study how multimodal large language models (MLLMs) can keep learning new vision-language tasks over time without forgetting old ones, which is called Multimodal Continual Instruction Tuning (MCIT). They point out that current methods using only image-text similarity to decide how to route tasks to different parts of the model can mix up tasks with similar visuals but different answer types, causing confusion. To fix this, they propose ProtoAda, which uses awareness of both the task meaning and the answer format to better assign tasks and update the model. Their experiments show ProtoAda improves performance, especially for tasks where answer formats differ and are usually easily messed up by ongoing learning.

Multimodal Large Language ModelsInstruction TuningContinual LearningVision-Language TasksMixture of LoRA ExpertsImage-Text SimilarityTask RoutingPrototype-guided TuningGradient InterferenceAnswer Format
Authors
Yu-Cheng Shi, Zhen-Hao Xie, Jun-Tao Tang, Da-Wei Zhou
Abstract
Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually acquire new vision-language capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. To reduce inter-task interference and promote collaboration, recent methods often employ sparse architectures like Mixture of LoRA Experts with image-text similarity routing. However, tasks with distinct response structures could share highly similar visual-linguistic semantics and thus be wrongly routed to the same expert; image-text similarity alone is insufficient for reliable task assignment. For example, an expert in a grounding task requiring coordinate prediction may be biased toward producing short textual answers after learning semantically similar VQA tasks. This format-blind task assignment integrates heterogeneous response types into shared parameters, inducing gradient interference and ineffective expert collaboration. To address this problem, we propose ProtoAda, a prototype-guided adaptive tuning framework. ProtoAda introduces format-aware task prototypes to align task assignment and routing with both task semantics and output structure, and further consolidates format-compatible updates in a geometry-aware manner to effectively reuse and progressively refine existing parameters. Extensive experiments on multiple benchmarks demonstrate that ProtoAda achieves superior performance, especially on tasks whose answer structures are easily corrupted by sequential tuning.