Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

2026-05-27Information Retrieval

Information RetrievalArtificial Intelligence
AI summary

The authors compare two ways that AI agents find useful data online: one that searches the open web without structured metadata (Baseline Agent) and another that uses well-organized datasets with semantic tags like schema.org (Semantic Agent). They found that the Semantic Agent is much better at finding precise, machine-readable data, while the Baseline Agent finds more results but often ends up with less useful pages. Their study shows that clear, structured metadata is still important for AI agents to reliably access and use data, especially for tasks that need exact information. Unstructured web search works better for broad exploration but not for exact data retrieval.

Semantic metadataschema.orgFAIR principlesLarge Language Modelsagentic data retrievalmachine-actionable dataprecisiondata accessibilityunstructured webdataset registries
Authors
Shiyu Chen, Tarfah Alrashed, Alon Halevy, Natasha Noy
Abstract
In the era of autonomous agents, machine-actionable data is critical for data-driven workflows. For more than a decade, semantic metadata like schema.org has anchored the FAIR principles (Findable, Accessible, Interoperable, and Reusable) for machine-actionable data and enabled discovery tools like Google Dataset Search. However, the rise of Large Language Models (LLMs) capable of navigating the unstructured web raises a fundamental question: Is semantic metadata still necessary for agentic data discovery, or can agents reliably retrieve actionable data directly from the web? We present a comparative analysis of agentic data retrieval across two distinct environments: a Baseline Agent searching billions of open-web documents, and a Semantic Agent leveraging a corpus of 90 million datasets using schema.org. We deploy an "LLM-as-a-judge" evaluation pipeline, mapped directly to the FAIR principles, to assess the semantic relevance, data accessibility, and computational utility of the retrieved data. Our results reveal a clear divergence. The Semantic Agent excels at retrieving actionable data, achieving a 44.9% higher precision for metadata-rich registries and a 46.6% higher precision for pages with machine-readable downloads among its returned results. Conversely, the Baseline Agent frequently suffers "Last-Mile Utility" failures, retrieving prose-heavy pages (20.1% of results) and portal landing pages (8.5%) rather than actual data pages. While the Baseline Agent achieves higher coverage by answering 40% more questions, the Semantic Agent delivers greater accuracy, achieving 65.7% higher overall precision in retrieving FAIR-compliant datasets. We conclude that while unstructured retrieval supports broad exploratory tasks, structured ecosystems remain the indispensable foundation for reliable, execution-oriented autonomous workflows.