Generalization and Scaling Laws for Mixture-of-Experts Transformers
2026-04-10 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors provide a mathematical framework to understand how Mixture-of-Experts (MoE) Transformers generalize and scale with more data or model size. They separate the complexity coming from which expert is chosen (routing) versus how many active parameters affect each input. Their theory shows that accuracy behaves similarly to regular dense networks once you focus on the active parts used per input. They also demonstrate ways MoE models can improve either by using more active parameters or by adding more experts, depending on what limits performance. Finally, they derive scaling rules that explain trade-offs in model size, training data, and computation.
Mixture-of-ExpertsTransformersgeneralization boundrouting patternscovering numbermetric entropymanifold data modelERM (Empirical Risk Minimization)scaling lawsapproximation theorem
Authors
Mansour Zoubeirou a Mayaki
Abstract
We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \emph{active} per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared loss, this yields a generalization bound under a $d$-dimensional manifold data model and $C^β$ targets, showing that approximation and estimation trade off as in dense networks once active parameters are accounted for appropriately. We further prove a constructive approximation theorem for MoE architectures, showing that, under the approximation construction, error can decrease either by scaling active capacity or by increasing the number of experts, depending on the dominant bottleneck. From these results we derive neural scaling laws for model size, data size, and compute-optimal tradeoffs. Overall, our results provide a transparent statistical reference point for reasoning about MoE scaling, clarifying which behaviors are certified by worst-case theory and which must arise from data-dependent routing structure or optimization dynamics.