Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC
2026-04-08 • Distributed, Parallel, and Cluster Computing
Distributed, Parallel, and Cluster ComputingMachine LearningOperating SystemsPerformanceSoftware Engineering
AI summaryⓘ
The authors present Blink, a new system for running large language models (LLMs) that avoids using the main CPU for most work during inference. Instead, Blink uses a SmartNIC and GPU to manage requests and handle processing directly, which reduces delays and improves efficiency. Their method outperforms current popular systems, running faster and using less energy per token. Importantly, Blink keeps its performance steady even when the CPU is busy with other tasks, unlike existing approaches that slow down significantly.
Large Language ModelInferenceSmartNICGPURDMABatchingKV-cacheToken-level ControlDatacenterThroughput
Authors
Mohammad Siavashi, Mariano Scazzariello, Gerald Q. Maguire, Dejan Kostić, Marco Chiesa
Abstract
Large Language Model (LLM) inference is rapidly becoming a core datacenter service, yet current serving stacks keep the host CPU on the critical path for orchestration and token-level control. This makes LLM performance sensitive to CPU interference, undermining application colocation and forcing operators to reserve CPU headroom, leaving substantial capacity unutilized. We introduce Blink, an end-to-end serving architecture that removes the host CPU from the steady-state inference path by redistributing responsibilities across a SmartNIC and a GPU. Blink offloads request handling to the SmartNIC, which delivers inputs directly into GPU memory via RDMA, and replaces host-driven scheduling with a persistent GPU kernel that performs batching, scheduling, and KV-cache management without CPU involvement. Evaluated against TensorRT-LLM, vLLM, and SGLang, Blink outperforms all baselines even in isolation, reducing pre-saturation P99 TTFT by up to 8.47$\times$ and P99 TPOT by up to 3.40$\times$, improving decode throughput by up to 2.1$\times$, and reducing energy per token by up to 48.6$\%$. Under CPU interference, Blink maintains stable performance, while existing systems degrade by up to two orders of magnitude.