PHAST-Net: Attention-Guided, Physics-Informed Network for Unified Estimation of Ideal Time-Frequency Representations

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors present PHAST-Net, a new network that uses attention and physics ideas to create clear and detailed time-frequency views of sound, such as speech and music. It builds on a special type of wavelet transform called CLAWT, designed to capture harmonic signal features effectively. Their method encourages consistency and energy balance during training, making the results more stable and accurate. PHAST-Net can also isolate fundamental frequencies, useful for analyzing pitch patterns, and represents signal features smoothly for flexible use. Tested on a large synthetic dataset, it outperforms existing techniques in analyzing complex sounds.

Time-frequency representationWavelet transformAttention mechanismCohen's class kernelSpectrogramTempogramMetrogramHarmonic analysisSpline trajectoryPhysics-informed learning
Authors
James M. Cozens, Simon J. Godsill
Abstract
We introduce PHAST-Net, an attention-guided, physics-informed network for unified estimation of Ideal Time-Frequency Representations (ITFRs), spanning spectral, tempo-based, metrical, and harmonic representations such as Spectrograms, Tempograms, and Metrograms. PHAST-Net learns an application-general mapping from a constellation of wavelet transforms, the proposed Continuous Log-frequency Adaptive Wavelet Transform (CLAWT), to high-resolution, cross-term-suppressed time-frequency (T-F) representations. The proposed constellation of CLAWTs is selected through Cohen's class kernel analysis to maximise curvature coverage in a logarithmic-frequency T-F plane tailored to harmonic signal structure. PHAST-Net further incorporates a proposed physics-informed auxiliary reprojection loss designed to reconstruct the idealised observed CLAWT constellation from the predicted ITFR and the corresponding Cohen's class kernels during training. This auxiliary objective promotes transform consistency and energy conservation, mitigates pathological target sparsity, and enhances optimisation stability. Attention layers further promote effective cross-term suppression across the input constellation. The log-frequency formulation also enables Harmonic PHAST-Net, which estimates a Harmonic ITFR that isolates fundamental structure, supporting robust fundamental-only representations for speech and music, such as derived fundamental Tempograms and Metrograms. We further introduce Spline-PHAST-Net, which parameterises detected and associated T-F ridges as continuous spline trajectories, enabling arbitrary-grid re-rendering and signal reconstruction. Trained on an effectively unbounded procedurally generated dataset, PHAST-Net demonstrates improved accuracy over established approaches, providing a unified framework for high-resolution, cross-term-robust analysis of speech, music, and broader nonstationary signals.