LEXI: Lossless Exponent Coding for Efficient Inter-Chiplet Communication in Hybrid LLMs

2026-03-16 • Hardware Architecture

Hardware Architecture

AI summaryⓘ

The authors found that a part of the number format used in large language models, called the exponent, is very predictable and can be compressed. They developed a method named LEXI that compresses these exponents without losing any information, which helps move data faster during model processing. This leads to shorter waiting times when the model makes predictions, without affecting accuracy or using much extra power or chip space. Their tests on several modern models show a significant reduction in communication delays and overall processing time.

large language modelsbfloat16exponent compressionHuffman codinginference latencychiplet architecturenetwork-on-chipShannon entropylossless compression

Authors

Miao Sun, Alish Kanani, Kaushik Shroff, Umit Ogras

Abstract

Data movement overheads increase the inference latency of state-of-the-art large language models (LLMs). These models commonly use the bfloat16 (BF16) format for stable training. Floating-point standards allocate eight bits to the exponent, but our profiling reveals that exponent streams exhibit fewer than 3 bits Shannon entropy, indicating high inherent compressibility. To exploit this potential, we propose LEXI, a novel lossless exponent compression scheme based on Huffman coding. LEXI compresses activations and caches on the fly while storing compressed weights for just-in-time decompression near compute, without sacrificing system throughput and model accuracy. The codecs at the ingress and egress ports of network-on-chip routers sustain the maximum link bandwidth via multi-lane LUT decoders, incurring only 0.09 percent area and energy overheads with GF 22 nm technology. LEXI reduces inter-chiplet communication and end-to-end inference latencies by 33-45 percent and 30-35 percent on modern Jamba, Zamba, and Qwen LLMs implemented on a homogeneous chiplet architecture.

View PDFOpen arXiv