BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

2026-04-17Computation and Language

Computation and LanguageArtificial Intelligence
AI summary

The authors created BAGEL, a new test to see how well language models know about animals without looking up information during testing. They gathered questions from scientific papers and trusted sources about different animal facts like where they live, how they behave, and how they sound. BAGEL helps check which parts of animal knowledge models are good at or struggle with. This helps training better models for tasks related to biodiversity and animal science.

large language modelsclosed-book evaluationanimal taxonomymorphologybioRxivspecies behaviorvocalizationbiodiversityfine-grained analysis
Authors
Jiacheng Shen, Masato Hagiwara, Milad Alizadeh, Ellen Gilsenan-McMahon, Marius Miron, David Robinson, Emmanuel Chemla, Sara Keen, Gagan Narula, Mathieu Laurière, Matthieu Geist, Olivier Pietquin
Abstract
Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.