MobileMoE: Scaling On-Device Mixture of Experts
2026-05-26 • Machine Learning
Machine LearningArtificial IntelligenceComputation and Language
AI summaryⓘ
The authors introduce MobileMoE, a new type of language model designed to work efficiently on smartphones with less than a billion active parameters. They created a method to optimize the model’s structure so it fits mobile devices' memory and computing limits, finding the best balance of expert components. MobileMoE performs as well or better than existing compact models while using fewer calculation steps and smaller model sizes. The team also shows how to run these models quickly on regular smartphones, making mobile use more practical.
Mixture-of-Experts (MoE)language modelson-device deploymentsparsitymodel scaling lawfine-grained expertsinstruction fine-tuningquantization-aware traininginference FLOPsINT4 quantization
Authors
Yanbei Chen, Hanxian Huang, Ernie Chang, Jacob Szwejbka, Digant Desai, Zechun Liu, Vikas Chandra, Raghuraman Krishnamoorthi
Abstract
Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4$\times$ fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers $1.8$-$3.8\times$ faster prefill and $2.2$-$3.4\times$ faster decode than the dense baseline MobileLLM-Pro.