Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web--Knowledge--Web Pipeline

2026-02-27Machine Learning

Machine Learning
AI summary

The authors developed a method called Web–Knowledge–Web (W→K→W) to find and organize information about small and medium-sized businesses in specialized industries, like semiconductor equipment manufacturing. Their method involves searching websites for potential supplier companies, building a knowledge graph connecting these companies and their relationships, and then using this graph to guide further searches in less-explored areas. They also created a way to estimate how complete their search is, based on techniques originally used to count species in nature. Their experiments showed that this approach found more accurate and comprehensive supplier information than other methods within the same amount of web pages searched.

Small and Medium-sized Enterprises (SMEs)Supply-chain resilienceKnowledge graphEntity extractionWeb crawlingCoverage estimationChao1 estimatorNAICS 333242Recall and PrecisionSpecies richness estimation
Authors
Yijiashun Qi, Yijiazhen Qi, Tanmay Wagh
Abstract
Identifying the full landscape of small and medium-sized enterprises (SMEs) in specialized industry sectors is critical for supply-chain resilience, yet existing business databases suffer from substantial coverage gaps -- particularly for sub-tier suppliers and firms in emerging niche markets. We propose a \textbf{Web--Knowledge--Web (W$\to$K$\to$W)} pipeline that iteratively (1)~crawls domain-specific web sources to discover candidate supplier entities, (2)~extracts and consolidates structured knowledge into a heterogeneous knowledge graph, and (3)~uses the knowledge graph's topology and coverage signals to guide subsequent crawling toward under-represented regions of the supplier space. To quantify discovery completeness, we introduce a \textbf{coverage estimation framework} inspired by ecological species-richness estimators (Chao1, ACE) adapted for web-entity populations. Experiments on the semiconductor equipment manufacturing sector (NAICS 333242) demonstrate that the W$\to$K$\to$W pipeline achieves the highest precision (0.138) and F1 (0.118) among all methods using the same 213-page crawl budget, building a knowledge graph of 765 entities and 586 relations while reaching peak recall by iteration~3 with only 112 pages.