Which Models Are Our Models Built On? Auditing Invisible Dependencies in Modern LLMs

2026-06-10 • Computation and Language

Computation and Language

AI summaryⓘ

The authors explain that modern large language models (LLMs) often rely on other models and data in complicated ways, making it hard to track all these connections. They created ModSleuth, a tool that maps out these dependencies clearly using public information. Their work highlights challenges like identifying what counts as a dependency and matching different references to the same resources. By applying ModSleuth to several LLMs, the authors uncovered hidden links, licensing issues, and documentation problems. They provide both the tool and the maps to help people better understand how LLMs are built.

large language modelsdependenciesartifactdata pipelinerecursive relationshipsmodel traininglicense obligationsdocumentation inconsistenciesModSleuthdependency graph

Authors

Sanjay Adhikesaven, Haoxiang Sun, Sewon Min

Abstract

Modern LLM training pipelines increasingly rely on other models to generate data, filter corpora, judge outputs, and guide development decisions. These dependencies are recursive: a model may depend on an upstream artifact whose own dependencies are documented only in separate releases and artifacts. As a result, the full dependency structure is fragmented across heterogeneous public artifacts, with complexity and recursive depth far outpacing humans' ability to trace. We introduce ModSleuth, an agentic system that recursively reconstructs LLM dependency graphs from public artifacts with source-grounded evidence. We find that the primary challenge is no longer information extraction, but defining what constitutes a dependency and reconciling artifact references across inconsistent documentation. We address these challenges through a formalization that distinguishes direct and indirect dependencies, represents heterogeneous pipeline roles through operation-centered relationships, and resolves artifact identities across names, versions, and repositories. Applying ModSleuth to four public-artifact-rich LLM releases, we recover 1,060 source-verified dependencies and construct large-scale dependency graphs of modern LLM development. These graphs reveal multi-hop license obligations, train-evaluation coupling, discrepancies between released and training-time artifacts, and documentation inconsistencies that would otherwise be difficult to uncover. We release ModSleuth and the resulting dependency graphs to support transparent analysis of the increasingly complex ecosystems underlying modern LLMs.

View PDFOpen arXiv