Flow-HOA: Generative Joint Optimization for Ambisonics Encoding via Flow Matching

2026-06-03Sound

Sound
AI summary

The authors present Flow-HOA, a new method to create 3D audio using sparse and irregular microphone setups, which are common in consumer devices. Their approach uses a special type of machine learning to design filters that preserve the sound's timing, frequency, and spatial direction accurately. Tests show that Flow-HOA performs better than existing methods on both fake and real recordings, producing clearer and more realistic spatial audio. This suggests their method can improve immersive sound experiences in virtual reality and communication apps.

Higher-Order Ambisonicssparse microphone arraysfinite impulse response filtersconditional flow matchingtime-domain fidelityspectral consistencyspatial audioimmersive communicationXR (extended reality)generative framework
Authors
Yuhuan You, Yufan Qian, Tianshu Qu, Bin Wang, Xueyang Lv
Abstract
Higher-Order Ambisonics (HOA) encoding from sparse, irregular microphone arrays remains a critical challenge for consumer spatial audio capture in immersive communication and XR. We propose Flow-HOA, a generative framework that jointly optimizes a multi-dimensional objective encompassing time-domain, spectral, and spatial fidelity while producing a deployable, time-invariant bank of Finite Impulse Response (FIR) encoding filters. Using conditional flow matching, the model learns to map a simple prior distribution to the target distribution of FIR filter coefficients. Training is guided by a composite loss that balances time-domain waveform fidelity, multi-resolution spectral consistency, sub-band energy preservation, and spatial directivity constraints. Objective evaluations on synthetically simulated data demonstrate improved performance over strong model-based baselines in both signal fidelity and spatial accuracy metrics. Subjective listening tests on real microphone array recordings further confirm that Flow-HOA yields higher overall sound quality with reduced artifacts, demonstrating generalization from synthetic training data to real-world capture conditions.