Valid Inference with Synthetic Data via Task Exchangeability

2026-06-11Artificial Intelligence

Artificial IntelligenceMachine Learning
AI summary

The authors look at how synthetic data—fake data made by computer models—can help scientists do more experiments and learn faster. However, such data can have problems like bias or errors. To fix this, they introduce a new idea called task exchangeability, which means the current problem is similar enough to past problems with real data. Using this idea, they create methods that let researchers trust results from synthetic data. They show how their methods work using examples like surveys and AI judging.

synthetic dataLLMtask exchangeabilityvalid inferencebiaspilot studiesproteomicsgenerative modelsAI evaluationsilicon samples
Authors
Lezhi Tan, Tijana Zrnic
Abstract
There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing for the use of LLM-generated "silicon samples" in pilot studies; AI evaluations increasingly rely on "LLM-as-a-judge" outputs; and proteomics research is accelerated by generative models that produce synthetic protein structures. These developments raise an intriguing possibility: synthetic data may help researchers ask more questions, run more studies, and accelerate discovery. But they also raise a fundamental concern: synthetic data can be biased, noisy, and misspecified. In this work, we propose statistical principles for using synthetic data in scientific research with provable validity guarantees. The key insight is a new technical condition that we call task exchangeability. Informally, this is a requirement that the researcher can identify historical tasks, for which real data is available, such that their current task of interest is exchangeable with the historical tasks in an appropriate mathematical sense. We develop methods for valid inference under task exchangeability, together with extensions that provide guarantees even beyond exchangeability. We demonstrate the framework on public opinion surveys with silicon samples and AI evaluation with autoraters.