MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

2026-06-03Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionComputation and LanguageMachine Learning
AI summary

The authors developed MusaCoder, a system that helps large language models create efficient GPU code for tensor computations. They combined new training methods with reinforcement learning and a tool called MooreEval to check if the generated code works well. Their approach improves stability in training and results in faster, more correct code compared to existing methods. The experiments show MusaCoder performs as well as or better than top models, proving the benefits of their full training framework on new GPU hardware.

GPU kernel generationLarge Language Models (LLMs)CUDAMUSA backendreinforcement learningMooreEvalexecution feedbackKernelBenchMoore Threads GPUscode synthesis
Authors
Kun Cheng, Songshuo Lu, Sicong Liao, Tankun Li, Yafei Zhang, Dong Yang, Qiheng Lv, Hua Wang, Zhi Chen, Yaohua Tang
Abstract
Native GPU kernel generation turns high-level tensor programs into executable, efficient low-level code. Existing Large Language Models (LLMs) struggle with this task, while execution-based reinforcement learning suffers from sparse rewards, reward hacking, and training instability. We present MusaCoder, a full-stack training framework for native GPU kernel generation on CUDA and MUSA backends. MusaCoder combines progressive kernel-oriented data synthesis, diversity-preserving rejection fine-tuning, and execution-feedback Reinforcement Learning (RL) through MooreEval, a distributed verifier and reward environment. To stabilize RL, MusaCoder introduces PrimeEcho for first-turn-anchored multi-turn rewards, Buffered Dynamic Retry for recovering signals from all-failed hard samples, and MirrorPop for off-policy sequence filtering. Experiments on KernelBench and a MUSA-ported variant show that MusaCoder outperforms strong open-source and proprietary baselines in both correctness and empirical speedup, with the 9B model matching or exceeding frontier closed-source models and the 27B model establishing a new state of the art. These results demonstrate not only the effectiveness of full-stack execution-feedback training for native kernel generation, but also the capability of Moore Threads GPUs to support the complete LLM post-training stack, providing a practical foundation for large-model training and optimization on emerging accelerators.