Data Enrichment for Symbolic Regression Using Diffusion Models

2026-05-31Machine Learning

Machine Learning
AI summary

The authors address the challenge of discovering scientific equations from limited and noisy data, which is usually hard for symbolic regression methods. They propose a new method that uses a smart combination of machine learning models guided by physics rules to fill in missing data realistically. Testing their approach on problems like heat flow and fluid dynamics, they show it helps find better equations even with very sparse information. Their method does this without needing experts to provide extra domain-specific data. This makes equation discovery more practical and reliable.

symbolic regressiondata enrichmentlatent diffusion modelvariational autoencoderphysics-informed modelingpartial differential equationsNavier-Stokes equationsgoverning equationsmachine learningscientific discovery
Authors
Simon De Reuver, Tamas Kristof Toth, Teddy Lazebnik
Abstract
Symbolic regression (SR) offers a route to scientific discovery by converting observations into interpretable governing equations. However, despite its promise, its reliability degrades sharply when spatiotemporal measurements are sparse, noisy, or physically incomplete, as commonly occurring in practice. Data enrichment (DE) has been shown to be able to mitigate this limitation, yet additional samples can mislead equation discovery unless they preserve the physical structure of the target system. Such implication of DE requires narrow domain expertise as well as technical fluidity, highly limiting its practical usefulness. In this study, we introduce a physics-guided latent diffusion framework for DE for down the line SR models. The proposed framework combines a variational autoencoder, a conditional latent diffusion model, and a physics-informed residual corrector to complete sparse observations with synthetic fields constrained by governing relations. We evaluate the approach on heat conduction, incompressible Navier-Stokes flow, and a moving single-mass Newtonian gravitational potential, using GPLearn, DEAP, and PySR as downstream SR backends. Our results reveal that physics-corrected enrichment consistently improves recovery in sparse regimes across physical dynamics and SR models. These results show that generative enrichment can strengthen equation discovery without additional domain expertise.