Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio

2026-03-09Sound

SoundArtificial IntelligenceMachine Learning
AI summary

The authors studied how autoregressive language models, which are usually used for text, can compress audio files without losing any sound details. They tested these models on different types of sounds and audio qualities, like music and speech, with bit depths up to 24-bit. They found that normal tokenization methods become too complex for high-quality audio, so they created a new approach called Trilobyte that handles large audio data more efficiently. Their results show these models beat standard compressors like FLAC at lower bit depths (8-bit and 16-bit), but the advantage shrinks at higher bit depths.

autoregressive language modelslossless audio compressionbit depthtokenizationTrilobyteFLACsampling rateraw waveformsvocabulary size
Authors
Phillip Long, Zachary Novack, Chris Donahue
Abstract
Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchmark LM-based compression on full-fidelity audio across diverse domains (music, speech, bioacoustics), sampling rates (16kHz-48kHz), and bit depths (8, 16, 24-bit). Standard sample-level tokenization becomes intractable at higher bit depths due to vocabulary size (65K for 16-bit; 16.7M for 24-bit). We propose Trilobyte, a byte-level tokenization schema for full resolution audio, improving vocabulary scaling from $O(2^{b})$ to $O(1)$ and enabling the first tractable 24-bit LM-based lossless compression. While LMs consistently outperform FLAC and yield state-of-the-art compression at 8-bit and 16-bit, we observe that compression gains become more modest as bit depth increases beyond 8-bit.