AI summaryⓘ
The authors study how to shrink large language models by reducing the number of bits used for each model weight, which helps run these models more efficiently on local devices. They note that existing simple methods stop improving accuracy below 3-4 bits per weight, while more complex methods do better but are harder to use. They introduce GSQ, a new scalar quantization technique that cleverly optimizes weight grouping and scaling using a mathematical trick called Gumbel-Softmax. Their method nearly matches the accuracy of complex approaches while being easier to implement and works well even on very large models. This shows that careful tuning of simpler methods can close much of the gap with more complicated ones.
weight quantizationscalar quantizationvector quantizationGumbel-Softmaxpost-training quantizationlarge language modelsbit-widthMixture-of-Experts modelsmodel compressioninference efficiency
Authors
Alireza Dadgarnia, Soroush Tabesh, Mahdi Nikdan, Michael Helcig, Eldar Kurtic, Dan Alistarh
Abstract
Weight quantization has become a standard tool for efficient LLM deployment, especially for local inference, where models are now routinely served at 2-3 bits per parameter. The state of the art is currently split into two sets of methods: simple scalar quantization techniques, such as GPTQ or AWQ, which are widely deployed but plateau in accuracy at 3-4 bits per parameter (bpp), and "second-generation" vector- or trellis-quantized methods, such as QTIP, GPTVQ and AQLM, which push the accuracy frontier at low bit-widths but are notoriously hard to implement and to scale, and have gained relatively less traction. In this paper, we ask whether this gap is fundamental, or whether a carefully optimized scalar quantizer can recover most of it. We answer in the affirmative, by introducing GSQ (Gumbel-Softmax Quantization), a post-training scalar quantization method which jointly learns the per-coordinate grid assignments and the per-group scales using a Gumbel-Softmax relaxation of the discrete grid. GSQ matches the cardinality of the relaxation to the small number of levels available in the target bit-width regime (e.g., 3-8 levels for ternary and 3 bpp, respectively), making the relaxation tight and the optimization tractable. Practically, on the standard Llama-3.1-8B/70B-Instruct models, GSQ closes most of the gap between scalar quantization and the QTIP frontier at 2 and 3 bits, while using a symmetric scalar grid with group-wise quantization, and thus fully compatible with existing scalar inference kernels. We further show that GSQ scales to trillion-scale Mixture-of-Experts models such as Kimi-K2.5, where vector-quantized methods are difficult to apply.