Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data

2026-06-02Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors focus on making large language models run more efficiently by reducing the memory used to store certain parts of the model to 2-bit precision, which usually makes the model less accurate. They introduce a mixed-precision approach that only compresses specific layers to 2-bit, while keeping others at higher precision, improving speed without hurting too much accuracy. To fix the accuracy loss in these highly compressed layers, they use a method called Recover-LoRA that trains small adapters using synthetic data. Their experiments show this approach recovers much of the lost accuracy without needing real labeled data, making it useful for deploying large models on devices with limited memory.

weight quantization2-bit precisionmixed-precisionMLP layersRecover-LoRAlow-rank adaptationlogit distillationsynthetic datalarge language modelspost-quantization recovery
Authors
Devleena Das, Rajeev Patwari, Elliott Delaye, Ashish Sirasao
Abstract
Aggressive weight quantization to 2-bit precision offers substantial throughput and memory gains for large language model (LLM) inference, but typically incurs severe accuracy degradation. These gains are particularly relevant for edge and on-device deployment, where memory capacity and bandwidth are primary constraints. In this work, we extend Recover-LoRA -- a lightweight, data-free accuracy recovery method originally developed for general model weight corruption -- to the setting of ultra-low-bit quantization. We propose a selective mixed-precision strategy in which only gate and up projection layers of the MLP are quantized to 2-bit (W2), while all other linear layers remain at higher precision, yielding a mixed-precision GateUp configuration. We demonstrate via roofline analysis across three model families (4B--20B) and two hardware platforms that a W4/W2-GateUp deployment (4-bit base with 2-bit gate/up) delivers 7.5--23.3\% TPS improvement over uniform W4 depending on model and context length, while confining quantization error to a predictable subset of layers. We then apply Recover-LoRA -- training low-rank adapters on the quantized layers via logit distillation with synthetic data -- to recover accuracy lost from 2-bit quantization of the gate and up layers. In a case study on Qwen3-4B, Recover-LoRA achieves 80--95\% accuracy recovery on 9 of 12 benchmarks, using only 10k synthetic training samples and no labeled data. We further demonstrate that synthetic data performs comparably to curated labeled data for distillation-based recovery, and that recovery generalizes to out-of-distribution evaluation tasks. Our results present Recover-LoRA as a practical post-quantization accuracy recovery tool for aggressive weight compression in deployment settings.