The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

2026-06-16 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors present the Stanford EDGAR Filings Dataset (SEFD), a publicly available collection of US Securities and Exchange Commission financial documents formatted for easy use in training large language models. This dataset provides clean, long documents like financial statements and risk reports, which are rare and valuable for teaching models about finance. They also offer two tests derived from this data: one for predicting financial numbers and another for reading complex financial tables. The authors emphasize that SEFD is efficient for training and does not overlap much with existing web text datasets.

Large Language ModelsSEC FilingsFinancial StatementsLong-Context DataText PretrainingFinancial ForecastingDocument UnderstandingOCRDataset CreationNatural Language Processing

Authors

Nick Bettencourt, Xiaowei Ding, Kay Giesecke

Abstract

As high-quality public web corpora become increasingly exhausted, clean long-context documents have become a scarce and expensive source of training data for large language models (LLMs). Existing long-context corpora are often proprietary and costly to acquire, synthetically generated, or concentrated in narrow domains such as programming. We introduce the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown for financial language modeling and evaluation. SEFD makes audited financial statements, risk disclosures, ownership reports, accounting notes, and market-moving event filings usable as long-context pretraining data and as a basis for financial reasoning, forecasting, compliance, and document understanding. The resulting corpus is token-efficient, model-ready, and has less than 0.1% overlap with Common Crawl-derived corpora. We release SEFD-v1, a 152B-token initial public snapshot, and provide corpus-level analyses of a larger 18.5M-filing archive estimated at 550B tokens. We further introduce two SEFD-derived benchmarks: EDGAR-Forecast, which evaluates filing-grounded numerical forecasting after model knowledge cutoffs, and EDGAR-OCR, which evaluates transcription of complex financial tables.

View PDFOpen arXiv