Unsupervised Machine Learning for Detecting Structural Anomalies in European Regional Statistics

2026-05-04Machine Learning

Machine Learning
AI summary

The authors address the challenge of spotting unusual combinations of socio-economic indicators across European regions using data from Eurostat. They test five machine learning methods on data like GDP, unemployment, education, and population density, and label regions as unusual if flagged by multiple methods. Their approach highlights regions that differ structurally, such as wealthy cities and less developed areas, without implying data errors. This method can help policymakers understand and respond to regional differences. The authors emphasize that their approach is practical, reproducible, and can be integrated into existing data checks.

anomaly detectionEurostatNUTS2 regionsGDP per capitaunemployment ratetertiary educationpopulation densitymachine learningIsolation ForestLocal Outlier Factor
Authors
Bogdan Oancea
Abstract
Ensuring the coherence of regional socio-economic statistics is a central task for national statistical institutes. Traditional validation tools, such as range edits, ratio checks, or univariate outlier detection, are effective for identifying extreme values in individual series but are less suited for detecting unusual combinations of indicators in high-dimensional settings. This paper proposes an unsupervised machine learning framework for identifying structurally atypical regional profiles within Europe using publicly available Eurostat data. We construct a cross-sectional dataset of NUTS2 regions (2022) covering four key indicators: GDP per capita in PPS, unemployment rate, tertiary educational attainment, and population density. We apply and compare five anomaly detection techniques, univariate z-scores, Mahalanobis distance, Isolation Forest, Local Outlier Factor, and One-Class SVM, and classify a region as a structural anomaly if it is flagged by at least three of the five methods. The findings show that machine learning methods identify a consistent set of regions whose multivariate profiles diverge substantially from the EU-wide pattern. These include both highly developed metropolitan economies (Brussels, Vienna, Berlin, Prague) and regions with persistent socio-economic disadvantages (Central and Western Slovakia, Northern Hungary, Castilla-La Mancha, Extremadura), as well as Istanbul, whose profile differs markedly from EU capital regions. Importantly, these anomalies do not necessarily signal data quality issues; rather, they reflect meaningful structural divergence that warrants analytical or policy attention. The proposed framework is fully reproducible, scalable, and compatible with existing validation workflows, offering a flexible tool for early detection of unusual regional configurations within the European Statistical System.