Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents
2026-04-03 • Computation and Language
Computation and Language
AI summaryⓘ
The authors studied how often citation URLs generated by large language models and research agents are fake or don't work. They found that 3-13% of URLs were completely made up and 5-18% couldn’t be reached. Some models create many fake links, while others mostly find real ones that have broken over time. To help fix this, they made a tool called urlhealth that checks if URLs are valid or just outdated. When models use this tool, they greatly reduce bad citations, making their references more reliable.
large language modelscitation URLshallucinated URLslink rotWayback Machineresearch agentsURL validationself-correctionurlhealth tool
Authors
Delip Rao, Eric Wong, Chris Callison-Burch
Abstract
Large language models and deep research agents supply citation URLs to support their claims, yet the reliability of these citations has not been systematically measured. We address six research questions about citation URL validity using 10 models and agents on DRBench (53,090 URLs) and 3 models on ExpertQA (168,021 URLs across 32 academic fields). We find that 3--13\% of citation URLs are hallucinated -- they have no record in the Wayback Machine and likely never existed -- while 5--18\% are non-resolving overall. Deep research agents generate substantially more citations per query than search-augmented LLMs but hallucinate URLs at higher rates. Domain effects are pronounced: non-resolving rates range from 5.4\% (Business) to 11.4\% (Theology), with per-model effects even larger. Decomposing failures reveals that some models fabricate every non-resolving URL, while others show substantial link-rot fractions indicating genuine retrieval. As a solution, we release urlhealth, an open-source tool for URL liveness checking and stale-vs-hallucinated classification using the Wayback Machine. In agentic self-correction experiments, models equipped with urlhealth reduce non-resolving citation URLs by $6\textrm{--}79\times$ to under 1\%, though effectiveness depends on the model's tool-use competence. The tool and all data are publicly available. Our characterization findings, failure taxonomy, and open-source tooling establish that citation URL validity is both measurable at scale and correctable in practice.