You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

2026-06-04Computation and Language

Computation and LanguageArtificial IntelligenceMachine Learning
AI summary

The authors address a problem with large language models (LLMs) slowing down when they have to think through long steps. They created a method called cross-layer sparse attention (CLSA) that shares some work between different layers of the model, so it only has to do a certain selection process once instead of many times. This saves a lot of time while keeping the model's accuracy high. Their tests show CLSA can make the model run several times faster, especially for tasks needing very long memory.

Large Language ModelsLong-context InferenceSparse AttentionKV CacheToken SelectionDecoding EfficiencyCross-layer SharingThroughputChain of Thought
Authors
Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang, Furu Wei
Abstract
Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face a practical efficiency-quality trade-off. Structured block sparse methods typically provide stronger acceleration but incur noticeable quality loss, while token sparse methods are usually more accurate yet deliver limited end-to-end speedup because top-k routing over the full cache remains expensive. In this work, we propose cross-layer sparse attention (CLSA), which is built on top of KV-sharing architectures such as YOCO. The core idea is to share not only the KV cache across cross-decoder layers, but also the routing index. A single indexer computes token-level top-k selection once and reuses the resulting index across layers, thereby preserving the fine-grained selectivity of token sparse attention while amortizing the routing overhead. The resulting architecture improves all major inference bottlenecks jointly, including pre-filling, KV-cache storage, and long-context decoding. Experiments across short-context and long-context benchmarks show that CLSA is both accurate and efficient, achieving up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context. These results suggest a more complete architectural solution for long-context LLMs that jointly advances model quality and inference efficiency.