The elbow statistic: Multiscale clustering statistical significance

2026-03-03Machine Learning

Machine Learning
AI summary

The authors address the challenge of choosing how many clusters to use when grouping data without labels. They developed ElbowSig, a method that turns the common 'elbow' technique into a formal statistical test, checking if clusters found are meaningfully different from random data. Their approach works with different types of clustering methods and can detect multiple levels of structure in data. Tests on simulated and real datasets show ElbowSig controls false positives well while revealing complex patterns missed by simpler methods.

unsupervised learningcluster number selectionelbow methodstatistical inferenceheterogeneity sequencenull distributionType-I errormultiscale structurehard clusteringfuzzy clustering
Authors
Francisco J. Perez-Reche
Abstract
Selecting the number of clusters remains a fundamental challenge in unsupervised learning. Existing criteria typically target a single ``optimal'' partition, often overlooking statistically meaningful structure present at multiple resolutions. We introduce ElbowSig, a framework that formalizes the heuristic ``elbow'' method as a rigorous inferential problem. Our approach centers on a normalized discrete curvature statistic derived from the cluster heterogeneity sequence, which is evaluated against a null distribution of unstructured data. We derive the asymptotic properties of this null statistic in both large-sample and high-dimensional regimes, characterizing its baseline behavior and stochastic variability. As an algorithm-agnostic procedure, ElbowSig requires only the heterogeneity sequence and is compatible with a wide range of clustering methods, including hard, fuzzy, and model-based clustering. Extensive experiments on synthetic and empirical datasets demonstrate that the method maintains appropriate Type-I error control while providing the power to resolve multiscale organizational structures that are typically obscured by single-resolution selection criteria.