SemanticVocoder: Bridging Audio Generation and Audio Understanding via Semantic Latents

2026-02-26Sound

Sound
AI summary

The authors found that existing audio generation models use a type of encoder called VAE, which focuses on capturing low-level sound details but mixes up the meaning of sounds. To improve this, they created SemanticVocoder, a new model that generates audio directly from higher-level meaning-based codes instead of low-level sound codes. This approach makes the generated audio clearer and easier to understand, and it also helps combine understanding and creating audio in one system. Their model performed better on standard tests compared to previous methods.

Variational Autoencoderlatent spacesemantic encodinggenerative vocodertext-to-audio generationFrechet Distanceaudio synthesisAudioCaps datasetfeature disentanglement
Authors
Zeyu Xie, Chenxing Li, Qiao Jin, Xuenan Xu, Guanrou Yang, Wenfu Wang, Mengyue Wu, Dong Yu, Yuexian Zou
Abstract
Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low-level acoustic details rather than semantically discriminative information, leading to entangled event semantics and complicating the training of generative models. To address these issues, we discard VAE acoustic latents and introduce semantic encoder latents, thereby proposing SemanticVocoder, a generative vocoder that directly synthesizes waveforms from semantic latents. Equipped with SemanticVocoder, our text-to-audio generation model achieves a Frechet Distance of 12.823 and a Frechet Audio Distance of 1.709 on the AudioCaps test set, as the introduced semantic latents exhibit superior discriminability compared to acoustic VAE latents. Beyond improved generation performance, it also serves as a promising attempt towards unifying audio understanding and generation within a shared semantic space. Generated samples are available at https://zeyuxie29.github.io/SemanticVocoder/.