Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

2026-04-20Artificial Intelligence

Artificial Intelligence
AI summary

The authors introduce BLF, a system designed to predict yes/no questions better than existing methods. It uses a special way to combine numbers and written explanations that get updated step-by-step, unlike older methods that just pile up information. BLF also runs several prediction trials and smartly combines their results, plus it adjusts predictions to handle unusual cases well. Testing on many past questions shows BLF outperforms top forecasting models. The authors also created a careful testing setup to ensure their comparisons are reliable.

Bayesian belief statenatural-language evidenceiterative tool-use loophierarchical multi-trial aggregationlogit-space shrinkagehierarchical calibrationPlatt scalingForecastBenchbacktestingstatistical methodology
Authors
Kevin Murphy
Abstract
We present BLF (Bayesian Linguistic Forecaster), an agentic system for binary forecasting that achieves state-of-the-art performance on the ForecastBench benchmark. The system is built on three ideas. (1) A Bayesian linguistic belief state: a semi-structured representation combining numerical probability estimates with natural-language evidence summaries, updated by the LLM at each step of an iterative tool-use loop. This contrasts with the common approach of appending all retrieved evidence to an ever-growing context. (2) Hierarchical multi-trial aggregation: running $K$ independent trials and combining them using logit-space shrinkage with a data-dependent prior. (3) Hierarchical calibration: Platt scaling with a hierarchical prior, which avoids over-shrinking extreme predictions for sources with skewed base rates. On 400 backtesting questions from the ForecastBench leaderboard, BLF outperforms all the top public methods, including Cassi, GPT-5, Grok~4.20, and Foresight-32B. Ablation studies show that the structured belief state is as impactful as web search access, and that shrinkage aggregation and hierarchical calibration each provide significant additional gains. In addition, we develop a robust back-testing framework with a leakage rate below 1.5\%, and use rigorous statistical methodology to compare different methods while controlling for various sources of noise.