Dual-Pool Token-Budget Routing for Cost-Efficient and Reliable LLM Serving

2026-04-09Computation and Language

Computation and Language
AI summary

The authors found that current setups for running large language models (LLMs) often waste computing power because they prepare for the longest possible input even though most requests are much shorter. To fix this, they created a system that splits the servers into two groups: one optimized for short inputs and one for long inputs, routing requests based on estimated input length without needing complex token analysis. Their tests showed this approach saves a lot of GPU time and money, reduces errors, and speeds up processing. They also provide a simple model to predict savings before trying it out. This method works smoothly with other optimization techniques and adapts to varying workloads.

vLLM fleetsKV-cachetoken budgetshort-context poollong-context pooltokenizerGPU throughputOOM crashesprefill-decode disaggregationPagedAttention
Authors
Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen
Abstract
Production vLLM fleets typically provision each instance for the worst-case context length, leading to substantial KV-cache over-allocation and under-utilized concurrency. In practice, 80-95% of requests are short, yet are served under configurations optimized for long contexts, wasting 4-8$\times$ throughput capacity and triggering reliability issues such as OOM crashes, preemption, and request rejections. We identify a common root cause for these inefficiencies: configuration-traffic mismatch. We propose dual-pool token-budget routing, a lightweight dispatch mechanism that partitions a homogeneous fleet into two specialized pools: a high-throughput short-context pool and a high-capacity long-context pool. Each request is routed based on its estimated total token budget, computed using a per-category bytes-to-token ratio that is learned online via exponential moving average from usage.prompt_tokens feedback, eliminating the need for a tokenizer. We also develop a simple analytical model that predicts fleet-level cost savings from workload characteristics and measured throughput differences, enabling practitioners to estimate benefits prior to deployment. Evaluations on real-world traces from the Azure LLM Inference Dataset and LMSYS-Chat-1M, serving Llama-3-70B on A100 GPUs, show that our approach reduces GPU-hours by 31-42%, corresponding to \$2.86M annual savings at fleet scale, while lowering preemption rates by 5.4$\times$ and improving P99 TTFT by 6%. A case study with Qwen3-235B-A22B on AMD MI300X at 10,000 req/s projects \$15.4M in annual savings. The method incurs only O(1) dispatch overhead, adapts automatically to heterogeneous workloads, and composes seamlessly with existing optimizations such as PagedAttention, continuous batching, and prefill-decode disaggregation.