Diffusion-Guided Feature Selection via Nishimori Temperature: Noise-Based Spectral Embedding

2026-04-27Machine Learning

Machine Learning
AI summary

The authors introduce Noise-Based Spectral Embedding (NBSE), a method to pick important features from large data without using slow step-by-step searches. NBSE makes a special graph connecting samples and finds a key temperature where a matrix changes behavior, helping identify crucial data patterns while avoiding bias towards popular data points. By looking at features in a new way, NBSE groups similar or redundant features and selects one from each group to simplify the data. Their tests on image recognition models show NBSE keeps accuracy high even when using much fewer features, doing better than some traditional methods. The authors also show their method is stable against noise in the data.

Spectral embeddingFeature selectionBethe HessianNishimori temperatureSimilarity graphEigenvectorDegree correctionDiffusion processGaussian noiseImageNet embeddings
Authors
Vasiliy S. Usatyuk, Denis A. Sapozhnikov, Sergey I. Egorov
Abstract
We propose Noise-Based Spectral Embedding (NBSE), a physics-informed framework for selecting informative features from high-dimensional data without greedy search. NBSE constructs a sparse similarity graph on the samples and identifies the Nishimori temperature $β_N$ the critical inverse temperature at which the Bethe Hessian becomes singular. The corresponding smallest eigenvector captures the dominant mode of an intrinsically degree-corrected diffusion process, naturally reweighting nodes to prevent hub dominance. By transposing the data matrix and applying NBSE in feature space, we obtain a one-dimensional spectral embedding that reveals groups of redundant or semantically related dimensions; balanced binning then selects one representative per group. We prove that coloured Gaussian perturbations shift $β_N$ by at most $O(\barσ^2)$, guaranteeing robustness to measurement noise. Experiments on ImageNet embeddings from MobileNetV2 and EfficientNet-B4 show that NBSE preserves classification accuracy even under aggressive compression: on EfficientNet-B4 the accuracy drop is below $1\%$ when retaining only $30\%$ of features, outperforming ANOVA $F$-test and random selection by up to $6.8\%$.