AI summaryⓘ
The authors study a method called additive quantization to shrink large language models for use on small devices, but it struggles a lot at very low precision (2-bit). They find that the main problem is how the codebook—the set of patterns used for compression—is initially set up, which often leads to poor results that can't be fixed later. To solve this, they propose a new way to initialize the codebook called OA-EM, which uses mathematical techniques to consider model output sensitivity. Their method works better across different models and compression levels, showing how starting conditions strongly affect the final compressed model quality. This work points out that in compressed models, good initialization is crucial for effective fine-tuning.
Additive QuantizationCodebook InitializationLow-bit CompressionPerplexityOutput-aware EMHessian-weighted Mahalanobis DistanceLarge Language ModelsFine-tuningOptimization Geometry
Abstract
Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio \r{ho} = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with \r{ho}: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.