ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues
2026-06-16 • Computation and Language
Computation and LanguageArtificial IntelligenceMachine Learning
AI summaryⓘ
The authors created ReproRepo, a system to check how well AI models can find problems when trying to reproduce machine learning research. Instead of manually checking everything, they used real issues reported on GitHub as clues about what goes wrong. They tested different AI models on over a thousand recent papers and showed that these models can spot many real reproducibility problems without running any code. Their system helps make evaluating AI tools easier and more scalable for future research checks.
ReproducibilityLarge Language Models (LLM)GitHub IssuesMachine LearningCode ExecutionReproduction BlockersScientific ProgressBenchmarkingSemantic LocalizationReproRepo
Authors
Shanda Li, Qiuhong Anna Wei, Jingwu Tang, Valerie Chen, Nihar B Shah, Tim Dettmers, Yiming Yang, Ameet Talwalkar
Abstract
Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at https://github.com/LithiumDA/ReproRepo.