Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

2026-04-27Computation and Language

Computation and Language
AI summary

The authors created ProHist-Bench, a new test to see how well large language models (LLMs) can handle tough historical questions. Unlike previous tests that only check basic facts or vocabulary, this one focuses on deeper historical reasoning using questions from the Chinese Imperial Examination system, which covers over 1,300 years of history. They tested 18 LLMs and found that even the best models have trouble with these complex questions. The authors hope their benchmark will help improve how LLMs work with historical research.

Large Language ModelsHistorical ReasoningChinese Imperial ExaminationKeju SystemBenchmarkingEvidentiary ReasoningEast Asian HistoryInterdisciplinary Research
Authors
Lirong Gao, Zeqing Wang, Yuyan Cai, Jiayi Deng, Yanmei Gu, Yiming Zhang, Jia Zhou, Yanfei Zhang, Junbo Zhao
Abstract
While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant proficiency gap: even state-of-the-art LLMs struggle with complex historical research questions. We hope ProHist-Bench will facilitate the development of domain-specific reasoning LLMs, advance computational historical research, and further uncover the untapped potential of LLMs. We release ProHist-Bench at https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench.