AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention
2026-04-09 • Computation and Language
Computation and Language
AI summaryⓘ
The authors address the problem of making large language models (LLMs) work better with very long input contexts, which usually slow down or use too much memory. They create AsyncTLS, a two-step method that first filters blocks of tokens roughly, then picks tokens more precisely, balancing speed and accuracy. They also develop a system that moves data in the background while the model computes, making everything faster. Tests show their method keeps accuracy similar to the full attention approach but speeds up operations and overall processing significantly.
Large Language ModelsAttention MechanismSparse AttentionKey-Value CacheToken-level AttentionBlock-level AttentionAsynchronous OffloadingLong Context InferenceThroughputHierarchical Attention
Authors
Yuxuan Hu, Jianchao Tan, Jiaqi Zhang, Wen Zan, Pingwei Sun, Yifan Lu, Yerui Sun, Yuchen Xie, Xunliang Cai, Jing Zhang
Abstract
Long-context inference in LLMs faces the dual challenges of quadratic attention complexity and prohibitive KV cache memory. While token-level sparse attention offers superior accuracy, its indexing overhead is costly; block-level methods improve efficiency but sacrifice precision. We propose AsyncTLS, a hierarchical sparse attention system that combines coarse-grained block filtering with fine-grained token selection to balance accuracy and efficiency, coupled with an asynchronous offloading engine that overlaps KV cache transfers with computation via temporal locality exploitation. Evaluated on Qwen3 and GLM-4.7-Flash across GQA, and MLA architectures, AsyncTLS achieves accuracy comparable to full attention while delivering 1.2x - 10.0x operator speedups and 1.3x - 4.7x end-to-end throughput improvements on 48k - 96k contexts.