Scaling State-Space Models on Multiple GPUs with Tensor Parallelism

2026-02-24Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster ComputingMachine Learning
AI summary

The authors studied how to make selective state space models (SSMs), which help large language models handle long texts, run faster on multiple GPUs. They designed a way to split the work across GPUs that keeps the important calculations close together to avoid slow communication. Testing on several SSM-based models, they showed their method can nearly double or even quadruple the speed depending on how many GPUs are used and the length of the text. They also improved efficiency by using a technique called quantized all-reduce, which reduces the communication cost. Overall, the authors focused on making SSM inference more practical and faster for large models on multi-GPU systems.

Selective State Space Models (SSMs)Large Language Models (LLMs)Tensor Parallelism (TP)Prefill and DecodeSSM State CacheRecurrent State UpdateQuantized AllReduceThroughputNVIDIA A6000 and A100Long Context Length
Authors
Anurag Dutt, Nimit Shah, Hazem Masarani, Anshul Gandhi
Abstract
Selective state space models (SSMs) have rapidly become a compelling backbone for large language models, especially for long-context workloads. Yet in deployment, their inference performance is often bounded by the memory capacity, bandwidth, and latency limits of a single GPU, making multi-GPU execution increasingly necessary. Although tensor parallelism (TP) is widely used to scale Transformer inference, applying it to selective SSM blocks is non-trivial because the SSM mixer couples large projections with a sequence-wise recurrent state update and local mixing whose efficiency depends on preserving locality and avoiding synchronization in the critical path. This paper presents a communication-efficient TP design for selective SSM inference that addresses three practical engineering challenges: enabling TTFT improvements via an SSM state cache across prefill and decode, partitioning the mixer's packed parameter tensor so that recurrent updates remain local while minimizing communication, and reducing TP aggregation overhead with quantized AllReduce. We evaluate on three representative SSM-based LLMs spanning pure-SSM and hybrid architectures - Mamba, Falcon-Mamba, and Zamba - on NVIDIA A6000 and A100 clusters. Our experiments show substantial throughput gains from tensor-parallel SSM inference, improving batch-request throughput by ~1.6-2.1x on 2 GPUs and ~2.6-4.0x on 4 GPUs for Mamba, with the largest benefits at long context lengths, and achieving a further ~10-18% throughput improvement from quantized all-reduce by lowering synchronization bandwidth overhead.