ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning

2026-04-09Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors study how to make video captioning faster and more efficient using open multimodal large language models (MLLMs). They note that existing models struggle because processing long video sequences takes too much computing power. To solve this, they create ABMamba, which uses a different method called Deep State Space Models and a new scanning approach that looks at videos in different time chunks. Their model works well on common tests and is about three times faster than usual MLLMs. This makes it easier to process and understand videos without needing huge computational resources.

video captioningmultimodal large language modelsTransformerattention mechanismDeep State Space Modelstemporal dependenciessequence lengthbidirectional scancomputational complexitybenchmark datasets
Authors
Daichi Yashima, Shuhei Kurita, Yusuke Oda, Shuntaro Suzuki, Seitaro Otsuki, Komei Sugiura
Abstract
In this study, we focus on video captioning by fully open multimodal large language models (MLLMs). The comprehension of visual sequences is challenging because of their intricate temporal dependencies and substantial sequence length. The core attention mechanisms of existing Transformer-based approaches scale quadratically with the sequence length, making them computationally prohibitive. To address these limitations, we propose Aligned Hierarchical Bidirectional Scan Mamba (ABMamba), a fully open MLLM with linear computational complexity that enables the scalable processing of video sequences. ABMamba extends Deep State Space Models as its language backbone, replacing the costly quadratic attention mechanisms, and employs a novel Aligned Hierarchical Bidirectional Scan module that processes videos across multiple temporal resolutions. On standard video captioning benchmarks such as VATEX and MSR-VTT, ABMamba demonstrates competitive performance compared to typical MLLMs while achieving approximately three times higher throughput.