Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference

2026-04-09Machine Learning

Machine LearningArtificial IntelligenceComputation and Language
AI summary

The authors study Mixture-of-Experts (MoE) models, which use many small expert parts but face delays when too many experts are activated at once. To fix this, they introduce an "activation budget" that limits how many experts can be used, aiming to keep the model fast without hurting its accuracy. They propose Alloc-MoE, a method that smartly divides this budget across layers (Alloc-L) and tokens (Alloc-T) to maintain performance. Their tests show that Alloc-MoE speeds up processing significantly while using fewer activations.

Mixture-of-Experts (MoE)activation budgetsparse activationlayer-level allocationtoken-level allocationdynamic programmingrouting scoresinference latencymodel efficiencyDeepSeek-V2-Lite
Authors
Baihui Liu, Kaiyuan Tian, Wei Wang, Zhaoning Zhang, Linbo Qiao, Dongsheng Li
Abstract
Mixture-of-Experts (MoE) has become a dominant architecture for scaling large language models due to their sparse activation mechanism. However, the substantial number of expert activations creates a critical latency bottleneck during inference, especially in resource-constrained deployment scenarios. Existing approaches that reduce expert activations potentially lead to severe model performance degradation. In this work, we introduce the concept of \emph{activation budget} as a constraint on the number of expert activations and propose Alloc-MoE, a unified framework that optimizes budget allocation coordinately at both the layer and token levels to minimize performance degradation. At the layer level, we introduce Alloc-L, which leverages sensitivity profiling and dynamic programming to determine the optimal allocation of expert activations across layers. At the token level, we propose Alloc-T, which dynamically redistributes activations based on routing scores, optimizing budget allocation without increasing latency. Extensive experiments across multiple MoE models demonstrate that Alloc-MoE maintains model performance under a constrained activation budget. Especially, Alloc-MoE achieves $1.15\times$ prefill and $1.34\times$ decode speedups on DeepSeek-V2-Lite at half of the original budget.