Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition
2026-02-18 • Machine Learning
Machine Learning
AI summaryⓘ
The authors developed a new method to create chemical analogs by learning small chemical changes called matched molecular pairs (MMPs) from a large dataset. Their model can generate diverse new molecules based on an input molecule and lets users control the types of chemical changes made. They also designed a system, MMPT-RAG, that uses examples from external sources to guide the generation process for more relevant results. Tests show their approach produces more varied, novel, and realistic chemical analogs useful for drug discovery.
matched molecular pairsmolecular analogsmachine learningchemical transformationsfoundation modelpromptingretrieval-augmented generationdrug discoverydiversitynovelty
Authors
Bo Pan, Peter Zhiping Zhang, Hao-Wei Pang, Alex Zhu, Xiang Yu, Liying Zhang, Liang Zhao
Abstract
Matched molecular pairs (MMPs) capture the local chemical edits that medicinal chemists routinely use to design analogs, but existing ML approaches either operate at the whole-molecule level with limited edit controllability or learn MMP-style edits from restricted settings and small models. We propose a variable-to-variable formulation of analog generation and train a foundation model on large-scale MMP transformations (MMPTs) to generate diverse variables conditioned on an input variable. To enable practical control, we develop prompting mechanisms that let the users specify preferred transformation patterns during generation. We further introduce MMPT-RAG, a retrieval-augmented framework that uses external reference analogs as contextual guidance to steer generation and generalize from project-specific series. Experiments on general chemical corpora and patent-specific datasets demonstrate improved diversity, novelty, and controllability, and show that our method recovers realistic analog structures in practical discovery scenarios.