Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM
2026-06-03 • Computation and Language
Computation and Language
AI summaryⓘ
The authors discuss a problem where Transformers, a type of AI model, get very slow and heavy when handling long inputs. They propose using Mamba, a specialized model that manages this better. To improve Mamba for tasks combining images and language, the authors introduce a new component that smartly reduces image information before feeding it into the model. This component also simplifies how images are prepared for the model, avoiding complicated manual steps. Tests show that this new approach makes the system work faster and better on various vision-language tasks.
TransformerLarge Language Models (LLMs)Computational complexityMambaStructured State-Space ModelCross-attentionVision-language modelingToken compressionMultimodal learning
Authors
SooHwan Eom, Jay Shim, Gwanhyeong Koo, Haebin Na, Mark A. Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo
Abstract
The Transformer's quadratic complexity with input length imposes an unsustainable computational load on large language models (LLMs). In contrast, the Selective Scan Structured State-Space Model, or Mamba, addresses this computational challenge effectively. This paper explores a query-based cross-modal projector designed to bolster Mamba's efficiency for vision-language modeling by compressing visual tokens based on input through the cross-attention mechanism. This innovative projector also removes the need for manually designing the 2D scan order of original image features when converting them into an input sequence for Mamba LLM. Experimental results across various vision-language understanding benchmarks show that the proposed cross-modal projector enhances Mamba-based multimodal LLMs, boosting both performance and throughput.