LCSHBench: A Multilingual, Consensus-Grounded Benchmark for Library of Congress Subject Heading Assignment
2026-06-03 • Digital Libraries
Digital LibrariesArtificial IntelligenceInformation Retrieval
AI summaryⓘ
The authors created LCSHBench, a large benchmark of over 22,000 books cataloged with Library of Congress Subject Headings (LCSH) in 15 languages, to help test automated subject cataloging systems. They included records only when at least two libraries agreed on the subject headings, reflecting that libraries often agree on the overall topic but not the exact words used. Their benchmark evaluates both exact heading matches and concept-level similarities using various metrics. As a test, they fine-tuned a smaller model that improved subject search across languages better than a larger existing model, though the results varied by language and need further confirmation.
Automated subject catalogingLibrary of Congress Subject Headings (LCSH)Benchmark datasetControlled vocabularyCross-lingual retrievalConcept-level matchingExact recallEmbedding modelsFine-tuningBibliographic records
Authors
Kwok Leong Tang
Abstract
Automated subject cataloging assigns controlledvocabulary headings to bibliographic records, but LCSH has no standard public benchmark. We introduce LCSHBench: 22,346 books in 15 languages from the openly licensed Harvard, Columbia, and Princeton catalogs. Records enter only when at least two independent cataloging agencies assigned LCSH; we release per-catalog provenance plus union and unanimous answer views. A concordance study of 465,187 works cataloged by all three libraries shows why this design matters: libraries usually agree on the underlying topic (93.3% share a concept-level heading) but often differ in exact expression (39.4% have identical heading sets). LCSHBench therefore scores both exact and concept matches, with set and rank metrics broken down by language and heading type, across open-vocabulary generation and full-vocabulary retrieval. As a first demonstration, a low-rank fine-tune of a 300M on-device embedder improves cross-lingual retrieval and beats a 3,072-dimensional hosted embedder on development exact recall@200 (0.659 vs 0.623). The language panel shows the gain is not uniform, and held-out-test and end-to-end confirmation remain future work.