ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

2026-06-09Artificial Intelligence

Artificial Intelligence
AI summary

The authors address a problem in large language models where reasoning steps make the model slow because it has to remember too much information. They propose ReasonAlloc, a new method that smartly manages the memory used during reasoning by deciding in advance and adjusting on the fly which parts of the model need more attention. They tested this method on math reasoning tasks and found it works better than older methods, especially when the memory budget is small. ReasonAlloc can be used easily with existing tools and doesn't slow down the model much.

large language modelschain-of-thought reasoningkey-value cachetoken evictionbudget allocationautoregressive decodingmathematical reasoning benchmarksinference bottlenecklayer-wise compressionhead-wise resource allocation
Authors
Wenhao Liu, Hao Shi, Yunhe Li, Weizhi Fei, Xiangyuan Wang, Mengzhe Ruan, Hanxu Hou, Peisong Wang, Linqi Song, Shuang Qiu
Abstract
Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value (KV) cache growth. Current decoding-time compression methods mitigate this issue via token eviction, but typically assume a uniform budget distribution across all layers and heads. In contrast, existing non-uniform budget allocation methods are predominantly designed for the static prompt prefill phase, and they do not capture the stepwise context demands of autoregressive reasoning. To bridge this gap, we propose ReasonAlloc, a training-free framework that recasts decoding-time KV compression as a hierarchical budget allocation problem. ReasonAlloc operates at two complementary levels: an offline layer-wise preallocation strategy captures an architecture-driven demand pattern which we call ``\textit{Reasoning Wave}'', while an online head-wise strategy reallocates resources during decoding to information-rich heads based on real-time utility. Evaluations on mathematical reasoning benchmarks (MATH-500, AIME~2024) using DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-14B, and AceReason-14B show that ReasonAlloc outperforms uniform-budget R-KV, SnapKV, and Pyramid-RKV (a baseline enforcing a static, monotonically decreasing layer budget), with the largest gains at small budgets (128-512 tokens). ReasonAlloc is plug-and-play with existing token-eviction policies and introduces negligible inference-time overhead.