SPQ: An Ensemble Technique for Large Language Model Compression

2026-02-20Computation and Language

Computation and Language
AI summary

The authors introduce SPQ, a method that shrinks large language models by combining three techniques: pruning unnecessary neurons, breaking down large matrices into smaller parts (SVD), and compressing data with 8-bit quantization. Each technique fixes a different inefficiency, and together they work better than using them alone. When tested on the LLaMA-2-7B model, SPQ reduced memory use by up to 75% while keeping or improving how well the model understands text. Compared to other similar methods, SPQ uses less memory and runs faster, making it better suited for devices with limited memory.

Large Language ModelModel CompressionSingular Value Decomposition (SVD)PruningQuantizationPerplexityMLP LayersInference ThroughputLLaMA-2-7B
Authors
Jiamin Yao, Eren Gultepe
Abstract
This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model (LLM) compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear quantization. Each component targets a different source of inefficiency: i) pruning removes redundant neurons in MLP layers, ii) SVD reduces attention projections into compact low-rank factors, iii) and 8-bit quantization uniformly compresses all linear layers. At matched compression ratios, SPQ outperforms individual methods (SVD-only, pruning-only, or quantization-only) in perplexity, demonstrating the benefit of combining complementary techniques. Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K. Compared to strong baselines like GPTQ and SparseGPT, SPQ offers competitive perplexity and accuracy while using less memory (6.86 GB vs. 7.16 GB for GPTQ). Moreover, SPQ improves inference throughput over GPTQ, achieving up to a 1.9x speedup, which further enhances its practicality for real-world deployment. The effectiveness of SPQ's robust compression through layer-aware and complementary compression techniques may provide practical deployment of LLMs in memory-constrained environments. Code is available at: https://github.com/JiaminYao/SPQ_LLM_Compression/