AudioGS: Spectrogram-Based Audio Gaussian Splatting for Sound Field Reconstruction

2026-04-10Sound

Sound
AI summary

The authors developed AudioGS, a new method to create 3D-like sound experiences without needing visual information. Instead of relying on images or scenes, AudioGS uses a special way to represent sound with 'Audio Gaussians' that capture details in time and frequency. Their approach models how sound changes with direction and distance to produce realistic binaural audio. Tests show AudioGS works better than previous methods that used visual cues, making the spatial sound clearer and more accurate.

spatial audiobinaural audio3D Gaussian SplattingspectrogramSpherical Harmonicsdirectionalitydistance attenuationphase correctionwaveform reconstructionReplay-NVAS dataset
Authors
Chunhao Bi, Houqiang Zhong, Zhixin Xu, Li Song, Zhengxue Cheng
Abstract
Spatial audio is fundamental to immersive virtual experiences, yet synthesizing high-fidelity binaural audio from sparse observations remains a significant challenge. Existing methods typically rely on implicit neural representations conditioned on visual priors, which often struggle to capture fine-grained acoustic structures. Inspired by 3D Gaussian Splatting (3DGS), we introduce AudioGS, a novel visual-free framework that explicitly encodes the sound field as a set of Audio Gaussians based on spectrograms. AudioGS associates each time-frequency bin with an Audio Gaussian equipped with dual Spherical Harmonic (SH) coefficients and a decay coefficient. For a target pose, we render binaural audio by evaluating the SH field to capture directionality, incorporating geometry-guided distance attenuation and phase correction, and reconstructing the waveform. Experiments on the Replay-NVAS dataset demonstrate that AudioGS successfully captures complex spatial cues and outperforms state-of-the-art visual-dependent baselines. Specifically, AudioGS reduces the magnitude reconstruction error (MAG) by over 14% and reduces the perceptual quality metric (DPAM) by approximately 25% compared to the best performing visual-guided method.