Designing highly potent compounds using a chemical language model

Chen, Hengwei; Bajorath, Jürgen

doi:10.1038/s41598-023-34683-x

Cited by 6 publications

(16 citation statements)

References 25 publications

(30 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This highlights the potential of LLMs in materials synthesis. Bajorath et al have developed a generative chemical language model to predict highly potent compounds from less potent ones as input for drug discovery 21 . Priyakumar et al enabled a transformer-decoder model named MolGPT inspired by GPT models to generate drug-like molecules 22 .…”

Section: Introductionmentioning

confidence: 99%

Knowledge-informed generation of organic structure-directing agents for zeolites using ChatGPT towards human-machine collaborative molecular design

Ito,

Muraoka,

Nakayama

2024

Preprint

View full text Add to dashboard Cite

Designing organic molecules lies at the heart of solving numerous chemistry-related challenges, necessitating effective collaboration between human intuition and computational power. This study demonstrates how general-purpose Large Language Models (LLMs) such as GPT-4 can facilitate the design of potent molecules, leveraging feedback from experiments and empirical knowledge through natural language. We used this approach to design organic structure-directing agents (OSDAs) that guide the crystallization of zeolites. A computational workflow was developed, wherein the LLM proposed novel OSDAs to stabilize targeted zeolites. The suggested candidates underwent evaluation through empirical screening criteria and atomistic simulation. Feedback was then provided to the LLM in natural language to refine subsequent proposals, thus progressively enhancing the proposed OSDAs and promoting the exploration of chemical space. The predicted candidates encompassed experimentally validated OSDAs, structurally analogous ones, and novel ones with superior affinity scores, underscoring the robust capability of the LLM. The collaborations between humans and machines, utilizing natural language as the communication interface, hold potential for application in other molecular design tasks, including drug design.

show abstract

Section: Introductionmentioning

confidence: 99%

Knowledge-informed generation of organic structure-directing agents for zeolites using ChatGPT towards human-machine collaborative molecular design

Ito,

Muraoka,

Nakayama

2024

Preprint

View full text Add to dashboard Cite

show abstract

“…It aimed at deriving models for predicting potent compounds for targets of interest without specifying numerical potency values across wide ranges, thereby circumventing some of the obstacles associated with benchmark compound potency predictions [ 11 ]. Previously, we derived transformer-based chemical language models (CLMs) for molecular string-to-string conversion conditioned on potency differences between pairs of structural analogues [ 14 , 15 ]. So-called conditional transformer models not only learn conditional probabilities for character sequence translation, but also for other context-dependent rules (such as molecular property constraints).…”

Section: Introductionmentioning

confidence: 99%

“…So-called conditional transformer models not only learn conditional probabilities for character sequence translation, but also for other context-dependent rules (such as molecular property constraints). Our rules included potency difference thresholds required for the formation of activity cliffs (i.e., analogue pairs having largest potency differences in compound activity classes) [ 14 ] or -in a generalized form- desired potency difference thresholds structural analogues [ 15 ]. In the latter case, transformer models were trained based on large numbers of analogue pairs with greatly varying potency differences.…”

Section: Introductionmentioning

confidence: 99%

“…In the latter case, transformer models were trained based on large numbers of analogue pairs with greatly varying potency differences. In both instances, conditional transformers consistently reproduced highly potent compounds from activity cliffs or other compound pairs for a variety of activity classes, thus providing proof-of-principle, and generated other structurally diverse candidate compounds [ 14 , 15 ]. On the basis of these findings, we extended this transformer architecture for generative modeling of potent compounds by a meta-learning framework for modeling in low compound data regimes [ 16 ].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Generative design of compounds with desired potency from target protein sequences using a multimodal biochemical language model

Chen,

Bajorath

2024

J Cheminform

Self Cite

View full text Add to dashboard Cite

Deep learning models adapted from natural language processing offer new opportunities for the prediction of active compounds via machine translation of sequential molecular data representations. For example, chemical language models are often derived for compound string transformation. Moreover, given the principal versatility of language models for translating different types of textual representations, off-the-beaten-path design tasks might be explored. In this work, we have investigated generative design of active compounds with desired potency from target sequence embeddings, representing a rather provoking prediction task. Therefore, a dual-component conditional language model was designed for learning from multimodal data. It comprised a protein language model component for generating target sequence embeddings and a conditional transformer for predicting new active compounds with desired potency. To this end, the designated “biochemical” language model was trained to learn mappings of combined protein sequence and compound potency value embeddings to corresponding compounds, fine-tuned on individual activity classes not encountered during model derivation, and evaluated on compound test sets that were structurally distinct from training sets. The biochemical language model correctly reproduced known compounds with different potency for all activity classes, providing proof-of-concept for the approach. Furthermore, the conditional model consistently reproduced larger numbers of known compounds as well as more potent compounds than an unconditional model, revealing a substantial effect of potency conditioning. The biochemical language model also generated structurally diverse candidate compounds departing from both fine-tuning and test compounds. Overall, generative compound design based on potency value-conditioned target sequence embeddings yielded promising results, rendering the approach attractive for further exploration and practical applications. Scientific contribution The approach introduced herein combines protein language model and chemical language model components, representing an advanced architecture, and is the first methodology for predicting compounds with desired potency from conditioned protein sequence data.

show abstract

“…Furthermore, ML-based and randomized potency value predictions are often only separated by narrow error margins of 1 to 2 orders of magnitude, which leads to artificially favorable predictions in benchmark settings. At least in part, these limitations result from compound potency and similarity distributions in target-based compound sets (often termed activity classes) that are commonly used for benchmarking. , As a possible alternative, potency predictions might be focused on identifying highly potent compounds, taking into account that it might be difficult to precisely predict their potency values, given that their magnitude is statistically underrepresented in activity classes.…”

Section: Introductionmentioning

confidence: 99%

Anatomy of Potency Predictions Focusing on Structural Analogues with Increasing Potency Differences Including Activity Cliffs

Janela,

Bajorath

2023

J. Chem. Inf. Model.

Self Cite

View full text Add to dashboard Cite

Potency predictions are popular in compound design and optimization but are complicated by intrinsic limitations. Moreover, even for nonlinear methods, activity cliffs (ACs, formed by structural analogues with large potency differences) represent challenging test cases for compound potency predictions. We have devised a new test system for potency predictions, including AC compounds, that is based on partitioned matched molecular pairs (MMP) and makes it possible to monitor prediction accuracy at the level of analogue pairs with increasing potency differences. The results of systematic predictions using different machine learning and control methods on MMP-based data sets revealed increasing prediction errors when potency differences between corresponding training and test compounds increased, including large prediction errors for AC compounds. At the global level, these prediction errors were not apparent due to the statistical dominance of analogue pairs with small potency differences. Test compounds from such pairs were accurately predicted and determined the observed global prediction accuracy. Shapley value analysis, an explainable artificial intelligence approach, was applied to identify structural features determining potency predictions using different methods. The analysis revealed that numerical predictions of different regression models were determined by features that were shared by MMP partner compounds or absent in these compounds, with opposing effects. These findings provided another rationale for accurate predictions of similar potency values for structural analogues and failures in predicting the potency of AC compounds.

show abstract

Designing highly potent compounds using a chemical language model

Cited by 6 publications

References 25 publications

Knowledge-informed generation of organic structure-directing agents for zeolites using ChatGPT towards human-machine collaborative molecular design

Knowledge-informed generation of organic structure-directing agents for zeolites using ChatGPT towards human-machine collaborative molecular design

Generative design of compounds with desired potency from target protein sequences using a multimodal biochemical language model

Anatomy of Potency Predictions Focusing on Structural Analogues with Increasing Potency Differences Including Activity Cliffs

Contact Info

Product

Resources

About