CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

2026-02-27Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors created CUDA Agent, a system that uses reinforcement learning to improve the way GPU code (CUDA kernels) is written for deep learning tasks. Unlike existing methods, their system learns and optimizes CUDA code more effectively by combining a smart data pipeline, tools for testing and profiling code, and stable training techniques. Their approach outperforms current compiler-based tools and advanced models, especially on challenging tests, making GPU kernel optimization less reliant on expert knowledge.

CUDAGPU kernelreinforcement learningdeep learningtorch.compilecode optimizationprofilingcompilerlarge language modelsKernelBench
Authors
Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, Hao Zhou
Abstract
GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100\%, 100\%, and 92\% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40\% on the hardest Level-3 setting.