Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

2026-02-18Sound

SoundComputation and Language
AI summary

The authors studied how to build large audio models that work directly with audio data, not just text or simplified audio representations. They tested different ways to train these models, including how much text to mix in and what kind of audio tokens to use, to find a good training method. They also explored how scaling up model size and data amount affects performance, discovering that more data is needed faster than bigger models. Using these findings, they created SODA, a set of models that can handle both audio and text tasks and showed it can adapt to things like translating speech while keeping the speaker's voice. Overall, their work helps improve audio models that understand and generate sound in diverse ways.

audio foundation modelsnext-token predictiondiscrete audio tokensscaling lawstraining data mixtureIsoFLOP analysisSODA modelsspeech-to-speech translationcross-modal learningmodel scaling
Authors
Potsawee Manakul, Woody Haosheng Gan, Martijn Bartelds, Guangzhi Sun, William Held, Diyi Yang
Abstract
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning $3{\times}10^{18}$ to $3{\times}10^{20}$ FLOPs, finding that optimal data grows 1.6$\times$ faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks -- we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.