Leech Lattice Vector Quantization for Efficient LLM Compression

2026-03-11 • Machine Learning

Machine Learning

AI summaryⓘ

The authors studied a new way to compress large language models by grouping parameters in blocks instead of one by one. They used a special mathematical structure called the Leech lattice, which is very good at packing points in 24 dimensions. They improved search and conversion methods to make this approach practical without needing lots of extra memory. Their method, called LLVQ, performs better than several recent compression techniques, showing that using high-dimensional lattices can help make model compression both effective and efficient.

scalar quantizationvector quantizationLeech latticesphere packingGolay codelarge language modelsmodel compressioncodebookdequantizationangular search

Authors

Tycho F. A. van der Ouderaa, Mart van Baalen, Paul Whatmough, Markus Nagel

Abstract

Scalar quantization of large language models (LLMs) is fundamentally limited by information-theoretic bounds. While vector quantization (VQ) overcomes these limits by encoding blocks of parameters jointly, practical implementations must avoid the need for expensive lookup mechanisms or other explicit codebook storage. Lattice approaches address this through highly structured and dense packing. This paper explores the Leech lattice, which, with its optimal sphere packing and kissing configurations at 24 dimensions, is the highest dimensional lattice known with such optimal properties. To make the Leech lattice usable for LLM quantization, we extend an existing search algorithm based on the extended Golay code construction, to i) support indexing, enabling conversion to and from bitstrings without materializing the codebook, ii) allow angular search over union of Leech lattice shells, iii) propose fully-parallelisable dequantization kernel. Together this yields a practical algorithm, namely Leech Lattice Vector Quantization (LLVQ). LLVQ delivers state-of-the-art LLM quantization performance, outperforming recent methods such as Quip\#, QTIP, and PVQ. These results highlight the importance of high-dimensional lattices for scalable, theoretically grounded model compression.

View PDFOpen arXiv