ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity

2026-06-09Artificial Intelligence

Artificial IntelligenceComputers and Society
AI summary

The authors created a test called ABC-Bench to see how well AI language models can do biology tasks that usually need expert humans. They tested the AI on things like writing code for lab robots, designing DNA pieces, and bypassing safety checks for DNA synthesis. The AI did better than average human experts in these tasks, especially when using known information and protocols. However, the AI was less strong on tasks needing new bioinformatics thinking. The authors also confirmed that one AI’s code successfully controlled a lab robot to build DNA correctly.

Large Language ModelsBiosecurityLiquid Handling RobotsDNA AssemblyIn silico BiologyBioinformaticsDNA Synthesis ScreeningOpenTronsAgentic AIWet-lab Validation
Authors
Andrew Bo Liu, Samira Nedungadi, Bryce Cai, Alex Kleinman, Harmon Bhasin, Seth Donoughe
Abstract
Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data. Increasingly, LLM agents can also perform in silico biology tasks that previously required experienced human biologists. These emerging AI capabilities offer new opportunities for scientific discovery and biomedical advances, but they also shift the landscape of biosecurity risks. To address this, we introduce the Agentic Bio-Capabilities Benchmark (ABC-Bench), a suite of tasks to measure agentic biosecurity-relevant capabilities. ABC-Bench evaluates LLM agents on both benign and dual-use biology tasks: writing code to operate liquid handling robots, designing DNA fragments for in vitro assembly, and evading DNA synthesis screening. These tasks require a combination of biology and software expertise. All tested LLM agents outperformed the median expert human baseliner on all three tasks. Agents performed highly on tasks drawing on published knowledge and well-documented protocols, and more weakly on a task requiring novel bioinformatics reasoning. In three wet-lab validation experiments, we found that OpenAI's o4-mini-high produced scripts that, when run on an OpenTrons liquid handling robot, successfully assembled DNA with expected sequences.