When Does Synthetic Data Augmentation Improve Score-Based Imbalanced Classification?

2026-06-24Machine Learning

Machine Learning
AI summary

The authors study how adding fake data to fix imbalanced classes affects how well classifiers score. They find that if the model is correctly specified, adding synthetic data doesn’t improve the best possible performance but might reduce random errors or cause bias if the fake data isn't accurate. However, if the model is wrong, synthetic data can help by changing class balance and fixing ranking mistakes. They provide mathematical bounds on these effects and validate them with simulations showing small gains when the model is right and more noticeable but unpredictable improvements when it is wrong.

Synthetic data augmentationClass imbalanceAUROCAUPRCLikelihood-ratio orderingScore-based classificationModel misspecificationFinite-sample varianceMetric-regretThreshold optimization
Authors
Zhengchi Ma, Pengfei Lyu, Anru R. Zhang
Abstract
Synthetic data augmentation is widely used to mitigate class imbalance, but its theoretical effects on score-based classification remain poorly understood. This paper develops a framework for characterizing when synthetic minority augmentation can improve threshold-integrated and threshold-optimized metrics, including AUROC, AUPRC, best-threshold balanced accuracy, and best-threshold \(\F_1\) score. We separate the effect of augmentation into two components: a change in effective class weighting and a discrepancy between the synthetic and true minority distributions. Under well-specified score models, the raw estimator already targets the likelihood-ratio ordering, which is population-optimal for the metrics considered. Consequently, augmentation cannot provide a fundamental population-level improvement beyond possible finite-sample variance reduction, and may introduce additional bias through synthetic distributional error. We further establish minimax lower bounds showing that the raw estimator already achieves the optimal metric-regret rate in the well-specified regime. Under misspecification, however, augmentation can play a qualitatively different role: by changing the effective class balance, it can alter the restricted-class projection and correct ranking errors induced by the raw imbalanced objective. We provide explicit improvement bounds quantifying the roles of approximation error, finite-sample estimation error, and synthetic distributional error. Simulation studies corroborate the theory, demonstrating limited gains under well-specification and nontrivial but nonmonotone improvements under misspecification.