Mixture-of-Depths Attention

2026-03-16 • Computation and Language

Computation and LanguageArtificial Intelligence

AI summaryⓘ

The authors address a problem in very deep large language models where important information gets weaker as it passes through layers. They propose a new method called mixture-of-depths attention (MoDA), which lets the model look at information from both the current and previous layers to keep useful details strong. They also create an efficient way to do this on hardware without slowing things down much. Tests show MoDA helps models understand and perform better with only a small cost in computing power. The authors also find that pairing MoDA with a certain optimization technique (post-norm) works best.

large language modelsdepth scalingresidual connectionsattention mechanismmixture-of-depths attentionperplexityFlashAttentionpost-normcomputational overheadsequence length

Authors

Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang

Abstract

Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .

View PDFOpen arXiv