Representation Matters in Randomized Smoothing for Audio Classification

2026-06-02Machine Learning

Machine LearningSound
AI summary

The authors explain that when using randomized smoothing (RS) to test how robust audio classifiers are against noise, it matters a lot exactly where and how the noise is added. Because audio data often goes through complicated changes like normalization and conversion to features, simply adding noise to the waveform or features isn't straightforward. They tested different smoothing approaches on two audio tasks and found that results vary depending on the audio representation and processing steps. Their work shows it's important to clearly define what part of the audio is being certified and how noise is applied to have meaningful robustness guarantees.

Randomized SmoothingAudio ClassificationWaveformLog-mel FeaturesNoise RobustnessSignal-to-Noise RatioNormalizationCertified RadiusPerturbation ModelSmoothing
Authors
Jong-Ik Park, Shreyas Chaudhari, José M. F. Moura, Carlee Joe-Wong
Abstract
Randomized smoothing (RS) certifies robustness in the vector space where Gaussian noise is added. In audio classification, this space is often not uniquely defined as standard pipelines normalize, range-control, and transform waveforms into log-mel or other spectral features. We show that direct RS is therefore under-specified unless the certified object and preprocessing policy are explicit. On two audio benchmarks, keyword spotting and environmental-sound classification, we study waveform, feature-space, and post-processed smoothing. Our diagnostics show why representation-aware reporting is necessary: at the same smoothing level $σ=0.0025$, the two datasets share the same median raw radius $.007996$, but different waveform energies yield different SNR-equivalent scales ($83.98$ vs. $90.97$ dB); log-mel smoothing gives higher positive-radius certified accuracy on environmental sounds ($68.42\%$ vs. $65.53\%$), certifying more examples with nonzero radius but over features rather than waveforms; and clipping or peak normalization changes the effective perturbation norm by roughly $230$--$351\times$. We therefore recommend that audio RS studies choose and report the task-specific certified object and perturbation model, including the perturbation location, gain policy, raw radius, and any post-noise geometry changes.