Learning What Not to Impute: An Uncertainty-Aware Diffusion Framework for Meaningful Missingness
2026-06-03 • Machine Learning
Machine Learning
AI summaryⓘ
The authors point out that in many datasets, some missing values are truly absent on purpose, while others are missing by accident and need to be filled in. They introduce a new problem called selective imputation, which aims to figure out which missing entries should be kept as missing and which should be guessed. To solve this, they propose Diff-Joint, a method that uses a special process called diffusion to both fill in missing data and decide which values are genuinely missing. Their tests show that Diff-Joint can tell apart meaningful missing values and also does a good job at filling in data to help with further analysis.
missing value imputationselective imputationdiffusion modelstabular datalatent maskconditional samplinguncertainty-aware aggregationdata imputationmachine learning
Authors
Lixing Zhang, Yidong Ouyang, Weifu Li, Shixiang Zhu, Guang Cheng, Liyan Xie
Abstract
Missing value imputation is a fundamental task in machine learning, with most existing methods assuming that all missing entries correspond to unobserved regular values. In many real-world datasets, however, missingness may arise from two distinct sources: some entries are meaningfully missing (intrinsically absent and semantically valid), while others are missing due to the observation process and should be imputed. We formalize this distinction as a selective imputation problem, where the goal is to jointly infer which missing entries should be preserved and which should be recovered. To address this challenge, we propose Diff-Joint, a diffusion-based framework that jointly models tabular data together with a latent missingness mask. The method alternates between conditional sampling and uncertainty-aware aggregation to iteratively refine both imputed values and missingness labels. Empirical results on synthetic and real-world datasets demonstrate that Diff-Joint effectively identifies meaningfully missing entries while achieving competitive imputation accuracy and improved downstream task performance.