AccompGen: Hierarchical Autoregressive Vocal Accompaniment Generation with Dual-Rate Codec Tokenization
2026-04-10 • Sound
SoundMultimedia
AI summaryⓘ
The authors developed AccompGen, a system that creates instrumental music to go along with a given vocal track. It uses a special way to represent voice and instrument sounds separately but aligned in time, allowing it to mix the two smoothly. Their approach involves three steps of generating sounds from simple to detailed and uses advanced AI techniques to make the training stable and improve results. This helps produce complete songs from just isolated singing voices.
instrumental accompanimentvocalsHuBERT tokensEnCodec tokensautoregressive modelTransformerclassifier-free guidancesemantic tokensacoustic tokensGEGLU activations
Authors
Jian Zhu, Jianwei Cui, Shihao Chen, Yubang Zhang, Cheng Luo
Abstract
We present AccompGen, a system that generates instrumental music audio to accompany input vocals. Given isolated singing voice, AccompGen produces a coherent instrumental accompaniment that can be directly mixed with the input to create complete music. We propose three key innovations over prior work: (1) a dual-rate codec tokenization scheme using HuBERT semantic tokens at 50,Hz for vocals and EnCodec acoustic tokens at 75,Hz for instrumentals, enabling time-aligned yet rate-independent modeling; (2) a three-stage hierarchical autoregressive architecture (semantic to coarse acoustic to fine acoustic) with interleaved multi-codebook prediction and classifier-free guidance; and (3) modern Transformer design choices including QK-norm, GEGLU activations, RMSNorm, and T5-style relative position bias for improved training stability and sequence generalization.