EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction
2026-05-08 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial Intelligence
AI summaryⓘ
The authors introduce EmambaIR, a new method for reconstructing images from event-based camera data that overcomes key issues in existing approaches. They design two components: one that efficiently focuses on the most important pixels across different data types, and another that improves how time-related information is processed without heavy computation. Their method performs better than current top techniques in tasks like removing motion blur, rain, and improving HDR images, while also using less memory and processing power. They tested on several datasets and shared their code publicly.
Event-based imagingConvolutional Neural Networks (CNNs)Vision Transformers (ViTs)Sparse attentionState Space Models (SSMs)Image reconstructionMotion deblurringHigh Dynamic Range (HDR)Temporal representationCross-modal fusion
Authors
Wei Yu, Yunhang Qian
Abstract
Recent event-based image reconstruction methods predominantly rely on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to process complementary event information. However, these architectures face fundamental limitations: CNNs often fail to capture global feature correlations, whereas ViTs incur quadratic computational complexity (e.g., $O(n^2)$), hindering their application in high-resolution scenarios. To address these bottlenecks, we introduce EmambaIR, an Efficient visual State Space Model designed for image reconstruction using spatially sparse and temporally continuous event streams. Our framework introduces two key components: the cross-modal Top-k Sparse Attention Module (TSAM) and the Gated State-Space Module (GSSM). TSAM efficiently performs pixel-level top-k sparse attention to guide cross-modal interactions, yielding rich yet sparse fusion features. Subsequently, GSSM utilizes a nonlinear gated unit to enhance the temporal representation of vanilla linear-complexity ($O(n)$) SSMs, effectively capturing global contextual dependencies without the typical computational overhead. Extensive experiments on six datasets across three diverse image reconstruction tasks - motion deblurring, deraining, and High Dynamic Range (HDR) enhancement - demonstrate that EmambaIR significantly outperforms state-of-the-art methods while offering substantial reductions in memory consumption and computational cost. The source code and data are publicly available at: https://github.com/YunhangWickert/EmambaIR