KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

2026-06-02Machine Learning

Machine Learning
AI summary

The authors studied how to make large language models work better when generating long sequences of text, which usually gets slower and uses lots of memory due to storing past information (KV-cache). They found that a process called quantization, which compresses this stored data, causes errors that build up over time when the model predicts one word after another. To fix this, they created a new method called KVarN that changes and normalizes the data in a special way, reducing these errors without needing extra calibration. Their approach improved performance on test tasks and works well even with very low-bit precision. They also shared an easy-to-use implementation for others to try out.

large language modeltest-time scalingKV-cachequantizationautoregressive decodingHadamard rotationvariance normalizationtoken scale2-bit precisionlong-horizon decoding
Authors
Lorenz K. Muller, Philippe Bich, Chiara Boretti, Hyun-Min Chang, Jiawei Zhuang, Lukas Cavigelli
Abstract
Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like settings and errors behave differently under autoregressive decoding. We show that in the latter regime, quantization errors accumulate across timesteps, driven primarily by incorrect token scales. We introduce KVarN, a calibration-free KV-cache quantizer that applies a Hadamard rotation followed by a dual-scaling variance normalization across both axes of the K and V matrices. We find that this combination fixes outlying token-scale errors and substantially reduces error accumulation over existing baselines. KVarN establishes a new state-of-theart for KV-cache quantization on generative benchmarks, including MATH500, AIME24 and HumanEval, at 2-bit precision. A vLLM implementation of the KVarN method is available at https://github.com/huawei-csl/KVarN