Measuring Mid-2025 LLM-Assistance on Novice Performance in Biology

2026-02-18Computers and Society

Computers and SocietyArtificial Intelligence
AI summary

The authors tested if large language models (LLMs) could help beginners perform tasks in a biology lab, especially tasks related to viral genetics. They ran a study with 153 people and found that LLMs did not significantly improve the chance of completing all tasks compared to using the internet. However, those using LLMs showed a small improvement in some steps, like cell culture. The study highlights that while LLMs show promise in computer tests, their real-world lab impact is still limited and needs more research.

large language modelsviral reverse geneticscell culturerandomized controlled trialbiosecuritybenchmarkslaboratory workflowhuman performanceBayesian modelingordinal regression
Authors
Shen Zhou Hong, Alex Kleinman, Alyssa Mathiowetz, Adam Howes, Julian Cohen, Suveer Ganta, Alex Letizia, Dora Liao, Deepika Pahari, Xavier Roberts-Gaal, Luca Righetti, Joe Torres
Abstract
Large language models (LLMs) perform strongly on biological benchmarks, raising concerns that they may help novice actors acquire dual-use laboratory skills. Yet, whether this translates to improved human performance in the physical laboratory remains unclear. To address this, we conducted a pre-registered, investigator-blinded, randomized controlled trial (June-August 2025; n = 153) evaluating whether LLMs improve novice performance in tasks that collectively model a viral reverse genetics workflow. We observed no significant difference in the primary endpoint of workflow completion (5.2% LLM vs. 6.6% Internet; P = 0.759), nor in the success rate of individual tasks. However, the LLM arm had numerically higher success rates in four of the five tasks, most notably for the cell culture task (68.8% LLM vs. 55.3% Internet; P = 0.059). Post-hoc Bayesian modeling of pooled data estimates an approximate 1.4-fold increase (95% CrI 0.74-2.62) in success for a "typical" reverse genetics task under LLM assistance. Ordinal regression modelling suggests that participants in the LLM arm were more likely to progress through intermediate steps across all tasks (posterior probability of a positive effect: 81%-96%). Overall, mid-2025 LLMs did not substantially increase novice completion of complex laboratory procedures but were associated with a modest performance benefit. These results reveal a gap between in silico benchmarks and real-world utility, underscoring the need for physical-world validation of AI biosecurity assessments as model capabilities and user proficiency evolve.