Robust Length Prediction: A Perspective from Heavy-Tailed Prompt-Conditioned Distributions
2026-04-09 • Machine Learning
Machine Learning
AI summaryⓘ
The authors explain that predicting how long a language model's output will be based just on the input prompt is tricky because the length can vary widely, not just be a single fixed number. They found that output lengths follow a heavy-tailed distribution, meaning some lengths happen much less often but can be very long. To handle this, the authors propose new methods called ProD that use multiple outputs from the same prompt to better estimate length, either by focusing on a robust single value or capturing the whole distribution of possible lengths. Their experiments show these methods improve length predictions compared to previous approaches.
Large Language ModelsOutput Length PredictionHeavy-Tailed DistributionPrompt ConditioningRobust EstimationBatchingDecodingHidden StatesMedian-based PredictionDistributional Prediction
Authors
Jing Wang, Yu-Yang Qian, Ke Xue, Chao Qian, Peng Zhao, Zhi-Hua Zhou
Abstract
Output-length prediction is important for efficient LLM serving, as it directly affects batching, memory reservation, and scheduling. For prompt-only length prediction, most existing methods use a one-shot sampled length as the label, implicitly treating each prompt as if it had one true target length. We show that this is unreliable: even under a fixed model and decoding setup, the same prompt induces a \emph{prompt-conditioned output length distribution}, not a deterministic scalar, and this distribution is consistent with \emph{heavy-tailed} behavior. Motivated by this, we cast length prediction as robust estimation from heavy-tailed prompt-conditioned length distributions. We propose prompt-conditioned length distribution (ProD) methods, which construct training targets from multiple independent generations of the same prompt. Two variants are developed to reuse the served LLM's hidden states: \mbox{ProD-M}, which uses a median-based target for robust point prediction, and ProD-D, which uses a distributional target that preserves prompt-conditioned uncertainty. We provide theoretical justifications by analyzing the estimation error under a surrogate model. Experiments across diverse scenarios show consistent gains in prediction quality.