Evaluating the Progression of Large Language Model Capabilities for Small-Molecule Drug Design

2026-04-17 • Machine Learning

Machine Learning

AI summaryⓘ

The authors created a set of realistic tasks to test how well large language models (LLMs) can help with small molecule drug design, like predicting molecular properties and designing new molecules. They turned these tasks into reinforcement learning environments, allowing models to be trained and evaluated more consistently. Their results show that while top models are good, there is still room to get better, especially when there isn't much data. Importantly, they found that using reinforcement learning to fine-tune smaller models helped them perform almost as well as the biggest models, suggesting a practical way to improve LLMs for drug discovery.

Large Language ModelsDrug DesignMolecular Property PredictionReinforcement LearningMolecular RepresentationPost-trainingBenchmarkingSmall MoleculesMolecular Design Tasks

Authors

Shriram Chennakesavalu, Kirill Shmilovich, Hayley Weir, Colin Grambow, John Bradshaw, Patricia Suriana, Chen Cheng, Kangway Chuang

Abstract

Large Language Models (LLMs) have the potential to accelerate small molecule drug design due to their ability to reason about information from diverse sources and formats. However, their practical utility remains unclear due to the lack of benchmarks that reflect real-world scenarios. In this work, we introduce a suite of chemically-grounded tasks spanning molecular property prediction, molecular representation transformations, and molecular design. Importantly, we formulate these tasks as reinforcement learning (RL) environments, enabling a unified approach for evaluation and post-training. Across three model families, we find that frontier models are increasingly proficient at chemical tasks, but that there is significant room for improvement, especially in experimental settings with low data. Critically, we show that RL-based post-training can substantially improve performance. A smaller model post-trained on our environments becomes competitive with state-of-the-art frontier models, despite a significantly weaker base model. This suggests a practical route toward employing LLMs in drug discovery; by combining carefully-designed evaluation tasks with targeted post-training, we can both elucidate and close critical capability gaps.

View PDFOpen arXiv