GNStor: Design of GPU-Native High-Performance Remote All-Flash Array
2026-06-03 • Operating Systems
Operating Systems
AI summaryⓘ
The authors present GNStor, a system that lets GPUs directly access remote all-flash arrays (AFAs) for storage without needing the CPU to manage input/output requests. They created a new software stack, GNoR, that matches how GPUs work and allows fast communication with storage devices. They also designed deEngine, which shares important storage tasks across the SSDs themselves, avoiding CPU involvement. Their approach improves speed and reduces computing time compared to previous systems.
GPUAll-flash array (AFA)NVMe over RDMA (NoR)I/O performanceSIMTSSD firmwareCPU bypassParallel computingStorage systemData-intensive applications
Authors
Shushu Yi, Wenbo Wu, Guoci Chen, Junrong Zhu, Shengwen Liang, Mao Bo, Chenying Huan, Chen Tian, Jie Zhang
Abstract
GPU has become the leading computing device for a wide range of data-intensive applications, which tightly collaborates with remote all-flash array (AFA) to accommodate ever-expanding datasets, facilitate multi-client data sharing, and guarantee fault tolerance. Although GPU is the center of computation, all I/O processes in existing GPU-AFA systems are still CPU-centric. CPU orchestrates remote I/O requests and executes a centralized AFA engine to take charge of AFA-level functionalities (e.g., access control and metadata persistence). This design disparity suffers from substantial CPU-GPU interaction overhead and I/O traffic amplification, compromising end-to-end I/O performance. In this work, we present \emph{GNStor}, a GPU-native AFA system that enables GPU to directly access remote AFA without CPU intervention in the I/O path, thereby fully exploiting the performance of AFA. Specifically, GNStor first proposes a GPU-centric NVMe over RDMA (NoR) software stack (named \emph{GNoR}), paving a fast path for GPUs to directly initiate NoR I/O requests to SSDs within remote AFA. GNoR employs an atomic-operation-based I/O orchestration design and follows the single-instruction-multiple-thread (SIMT) execution model of GPU, fully exploiting the massive parallelism of GPU architectures. To facilitate essential AFA functionalities in a CPU-bypass I/O path, GNStor further designs \emph{deEngine}, a decentralized AFA engine that seamlessly decomposes and integrates AFA-level tasks into each SSD firmware, thereby achieving efficient AFA access at low cost. Evaluation results show that GNStor achieves 3.2$\times$ higher I/O throughput and reduces application execution time by 31.1\%, compared to state-of-the-art AFA systems.