Mean Estimation from Coarse Data: Characterizations and Efficient Algorithms

2026-02-26Machine Learning

Machine LearningData Structures and Algorithms
AI summary

The authors study how to estimate the average of a group of values when you only see which group or set each value belongs to, instead of the exact number. These groups come from dividing the space into convex parts, and the original data follow a multidimensional bell curve (Gaussian). They build on previous work that showed it is possible to find the average when the groups are convex and the average is unique, but hard otherwise. The authors answer two key questions: when can the average be uniquely found, and can it be done efficiently? They provide solutions to both, clarifying when and how mean estimation works with such partial data.

coarse dataGaussian mean estimationconvex partitionsidentifiabilitysample efficiencyNP-hardnesshigh-dimensional statisticsset-valued observationscomputational efficiency
Authors
Alkis Kalavasis, Anay Mehrotra, Manolis Zampetakis, Felix Zhou, Ziyu Zhu
Abstract
Coarse data arise when learners observe only partial information about samples; namely, a set containing the sample rather than its exact value. This occurs naturally through measurement rounding, sensor limitations, and lag in economic systems. We study Gaussian mean estimation from coarse data, where each true sample $x$ is drawn from a $d$-dimensional Gaussian distribution with identity covariance, but is revealed only through the set of a partition containing $x$. When the coarse samples, roughly speaking, have ``low'' information, the mean cannot be uniquely recovered from observed samples (i.e., the problem is not identifiable). Recent work by Fotakis, Kalavasis, Kontonis, and Tzamos [FKKT21] established that sample-efficient mean estimation is possible when the unknown mean is identifiable and the partition consists of only convex sets. Moreover, they showed that without convexity, mean estimation becomes NP-hard. However, two fundamental questions remained open: (1) When is the mean identifiable under convex partitions? (2) Is computationally efficient estimation possible under identifiability and convex partitions? This work resolves both questions. [...]