Noise Titration: Exact Distributional Benchmarking for Probabilistic Time Series Forecasting
2026-03-23 • Machine Learning
Machine Learning
AI summaryⓘ
The authors argue that current time series forecasting methods are tested in a way that doesn't truly prove if they work under changing conditions. They suggest adding controlled noise to simulated data where the rules are known, making it possible to measure forecasting accuracy exactly. To do this, they improve a model called Fern to better handle noise and relationships between variables. Their tests show that big, popular forecasting models struggle with changes and noise, while their improved Fern model stays accurate. This work proposes a more precise way to evaluate forecasting models.
time series forecastingnon-stationarityGaussian noisechaotic systemsstochastic dynamicsnegative log-likelihoodFern architectureSymmetric Positive Definite conejoint covariancedistributional inference
Authors
Qilin Wang
Abstract
Modern time series forecasting is evaluated almost entirely through passive observation of single historical trajectories, rendering claims about a model's robustness to non-stationarity fundamentally unfalsifiable. We propose a paradigm shift toward interventionist, exact-statistical benchmarking. By systematically titrating calibrated Gaussian observation noise into known chaotic and stochastic dynamical systems, we transform forecasting from a black-box sequence matching game into an exact distributional inference task. Because the underlying data-generating process and noise variance are mathematically explicit, evaluation can rely on exact negative log-likelihoods and calibrated distributional tests rather than heuristic approximations. To fully leverage this framework, we extend the Fern architecture into a probabilistic generative model that natively parameterizes the Symmetric Positive Definite (SPD) cone, outputting calibrated joint covariance structures without the computational bottleneck of generic Jacobian modeling. Under this rigorous evaluation, we find that state-of-the-art zero-shot foundation models behave consistently with the context-parroting mechanism, failing systematically under non-stationary regime shifts and elevated noise. In contrast, Fern explicitly captures the invariant measure and multivariate geometry of the underlying dynamics, maintaining structural fidelity and statistically sharp calibration precisely where massive sequence-matching models collapse.