Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning

2026-06-02Artificial Intelligence

Artificial Intelligence
AI summary

The authors created Hedge-Bench 1.0, a test set made from real tasks performed by professional hedge fund analysts to better evaluate AI reasoning in finance. Unlike other tests, this one checks if AI can follow expert reasoning steps exactly, avoiding guesswork or noisy judgment. Current top AI models score very low, below 16%, showing this is still a hard problem. The authors also share the dataset and tools openly for others to use.

hedge fund analystfinancial analysisAI benchmarkingopen-ended reasoningdeterministic gradingexpert reasoningevaluation harnessdataset
Authors
Eric Cho, Shawn Huang, Alice Lu, Andy Lyu
Abstract
AI agents can increasingly handle the mechanical tasks of financial analysis: retrieving documents, calculating formulas, updating spreadsheets. The harder, more valuable challenge is reasoning through the open-ended questions that define expert Analyst work. Existing benchmarks do not capture this class of problem, and those that attempt to evaluate open-ended reasoning rely on model-judged outputs that introduce noise and circularity. We present Hedge-Bench 1.0: a benchmark of 102 actual, on-the-job tasks grounded in the explicit reasoning traces of professional hedge fund analysts working with relevant information sources. This approach enables deterministic grading against verified expert steps. Frontier models and agents score below 16\% on the benchmark. We publish the dataset and evaluation harness at github.com/Trata-Inc/trata-hedge-bench.