STaR-Quant: State-Time Consistent Post-Training Quantization for Diffusion Large Language Models
2026-06-03 • Machine Learning
Machine Learning
AI summaryⓘ
The authors study how to make diffusion large language models (DLLMs) run more efficiently by reducing the size of their numbers without losing much accuracy. They find two main problems: different parts of the model behave differently depending on the text masked at each step, and small errors add up when processing text step-by-step. To fix this, they propose a method called STaR-Quant that adjusts activations based on token states and corrects errors over time. Their method improves model efficiency with less memory use and faster speeds compared to existing approaches.
Diffusion large language modelsPost-training quantizationMasked denoisingActivation distributionQuantization errorIterative decodingState-Guided Activation TransformationTemporal Attention CompensationFP16 deploymentMemory efficiency
Authors
Xin Yan, Aqiang Wang, Zhenglin Wan, Xingrui Yuand Ivor Tsang
Abstract
Diffusion large language models (DLLMs) have recently emerged as a promising alternative to autoregressive LLMs by generating text through iterative masked denoising with bidirectional context. However, their large model sizes and iterative denoising process introduce substantial memory and computational overhead, motivating post-training quantization for efficient deployment. In this paper, we identify two key challenges for low-bit DLLM quantization: state-dependent activation disparity and temporal error accumulation. Masked and unmasked tokens exhibit different activation distributions within each denoising step, while quantization errors can accumulate across steps during iterative decoding. To address these challenges, we propose STaR-Quant, a state-time consistent PTQ framework for DLLMs. STaR-Quant introduces State-Guided Activation Transformation (SGAT) to assign masked and unmasked tokens to different activation transformation spaces with a unified static weight-side transformation. It further introduces Temporal Attention Compensation (TAC) to correct the quantized attention representation via a lightweight block-diagonal affine mapping. Experiments on representative DLLMs demonstrate that STaR-Quant consistently improves low-bit weight-activation quantization over strong PTQ baselines, while delivering up to 1.69x speedup and 3.14x memory saving over FP16 deployment.