Beyond Mamba: Enhancing State-space Models with Deformable Dilated Convolutions for Multi-scale Traffic Object Detection

2026-04-09Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors identify that detecting traffic objects of different sizes is tough because small objects have many local details that are hard to spot alongside bigger, global features. They note that current models struggle with capturing these details and combining information across different scales. To fix this, they propose MDDCNet, a new network that uses special convolutions and attention methods to better represent both small details and overall context. Their experiments show this method works better than other advanced detectors on public and real-world traffic datasets.

object detectiontraffic scenariolong-range dependenciesdeformable dilated convolutionshierarchical feature representationfeed-forward networkattention mechanismfeature pyramid networkmulti-scale feature fusion
Authors
Jun Li, Yingying Shi, Zhixuan Ruan, Nan Guo, Jianhua Xu
Abstract
In a real-world traffic scenario, varying-scale objects are usually distributed in a cluttered background, which poses great challenges to accurate detection. Although current Mamba-based methods can efficiently model long-range dependencies, they still struggle to capture small objects with abundant local details, which hinders joint modeling of local structures and global semantics. Moreover, state-space models exhibit limited hierarchical feature representation and weak cross-scale interaction due to flat sequential modeling and insufficient spatial inductive biases, leading to sub-optimal performance in complex scenes. To address these issues, we propose a Mamba with Deformable Dilated Convolutions Network (MDDCNet) for accurate traffic object detection in this study. In MDDCNet, a well-designed hybrid backbone with successive Multi-Scale Deformable Dilated Convolution (MSDDC) blocks and Mamba blocks enables hierarchical feature representation from local details to global semantics. Meanwhile, a Channel-Enhanced Feed-Forward Network (CE-FFN) is further devised to overcome the limited channel interaction capability of conventional feed-forward networks, whilst a Mamba-based Attention-Aggregating Feature Pyramid Network (A^2FPN) is constructed to achieve enhanced multi-scale feature fusion and interaction. Extensive experimental results on public benchmark and real-world datasets demonstrate the superiority of our method over various advanced detectors. The code is available at https://github.com/Bettermea/MDDCNet.