Improved Lipophilicity and Aqueous Solubility Prediction with Composite Graph Neural Networks

Wieder, Oliver; Kuenemann, Mélaine A.; Wieder, Marcus; Seidel, Thomas; Meyer, Christophe; Bryant, Sharon D.; Langer, Thierry

doi:10.3390/molecules26206185

Cited by 23 publications

(20 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Numerous computational approaches for the prediction of log P have been developed. , We provide a brief overview here and point toward more detailed descriptions elsewhere. ,, Computational approaches to log P prediction can be grouped into (i) empirical and (ii) physics-based methods. , Empirical methods (i) include contribution-type approaches (atom- or fragment-based , ), QSAR approaches, and deep learning approaches ,,− trained on experimental data. Contribution-type approaches obtain a log P estimate by dividing molecules into either individual atoms or fragments and summing up their contributions, using correction terms in the latter case .…”

Section: Introductionmentioning

confidence: 99%

Machine Learning for Fast, Quantum Mechanics-Based Approximation of Drug Lipophilicity

et al. 2023

View full text Add to dashboard Cite

Lipophilicity, as measured by the partition coefficient between octanol and water (log P), is a key parameter in early drug discovery research. However, measuring log P experimentally is difficult for specific compounds and log P ranges. The resulting lack of reliable experimental data impedes development of accurate in silico models for such compounds. In certain discovery projects at Novartis focused on such compounds, a quantum mechanics (QM)-based tool for log P estimation has emerged as a valuable supplement to experimental measurements and as a preferred alternative to existing empirical models. However, this QM-based approach incurs a substantial computational cost, limiting its applicability to small series and prohibiting quick, interactive ideation. This work explores a set of machine learning models (Random Forest, Lasso, XGBoost, Chemprop, and Chemprop3D) to learn calculated log P values on both a public data set and an in-house data set to obtain a computationally affordable, QM-based estimation of drug lipophilicity. The message-passing neural network model Chemprop emerged as the best performing model with mean absolute errors of 0.44 and 0.34 log units for scaffold split test sets of the public and in-house data sets, respectively. Analysis of learning curves suggests that a further decrease in the test set error can be achieved by increasing the training set size. While models directly trained on experimental data perform better at approximating experimentally determined log P values than models trained on calculated values, we discuss the potential advantages of using calculated log P values going beyond the limits of experimental quantitation. We analyze the impact of the data set splitting strategy and gain insights into model failure modes. Potential use cases for the presented models include pre-screening of large compound collections and prioritization of compounds for full QM calculations.

show abstract

Section: Introductionmentioning

confidence: 99%

Machine Learning for Fast, Quantum Mechanics-Based Approximation of Drug Lipophilicity

et al. 2023

View full text Add to dashboard Cite

show abstract

“…As a result of regression, 18 out of 29 test sets showed R 2 > 0.80 (as can be seen in Figure S14). This suggests that solubility between unknown compounds can be still predicted using one of our models, as opposed to approaches in previous studies ,,,− where individual models were trained with sufficient data for solvents.…”

Section: Results and Discussionmentioning

confidence: 99%

“…For example, DGraphDTA is a multi-input network for drug–target affinity prediction, and the combination of a GCN and graph attention networks (GATs) further improves the model. Recently, the prediction of aqueous solubility using a graph-based message passing network, directed edge graph isomorphism network, and multilevel GCN has been reported. − However, in the previous studies mentioned above, it is common to predict the solubility using an individual data set and a model for one solvent in the previous studies, so there is a limitation because sufficient data is required. Therefore, in this study, a methodology for predicting the solubility between solute–solvents of various combinations was proposed.…”

Section: Introductionmentioning

confidence: 99%

Novel Solubility Prediction Models: Molecular Fingerprints and Physicochemical Features vs Graph Convolutional Neural Networks

Lee

Gyak

et al. 2022

ACS Omega

View full text Add to dashboard Cite

Predicting both accurate and reliable solubility values has long been a crucial but challenging task. In this work, surrogated model-based methods were developed to accurately predict the solubility of two molecules (solute and solvent) through machine learning and deep learning. The current study employed two methods: (1) converting molecules into molecular fingerprints and adding optimal physicochemical properties as descriptors and (2) using graph convolutional network (GCN) models to convert molecules into a graph representation and deal with prediction tasks. Then, two prediction tasks were conducted with each method: (1) the solubility value (regression) and (2) the solubility class (classification). The fingerprint-based method clearly demonstrates that high performance is possible by adding simple but significant physicochemical descriptors to molecular fingerprints, while the GCN method shows that it is possible to predict various properties of chemical compounds with relatively simplified features from the graph representation. The developed methodologies provide a comprehensive understanding of constructing a proper model for predicting solubility and can be employed to find suitable solutes and solvents.

show abstract

“…Recently, graph-based deep learning (DL) methods have attracted a multitude of attention and manifested remarkable effects on drug discovery ranging from molecular property prediction to drug virtual screening . These methods are capable of learning suitable molecular representations directly from chemical graphs in an end-to-end fashion. , Related works in lipophilicity and solubility prediction have verified the superiority of graph representation learning models, including undirected graph recursive neural networks (UGRNN) and a variety of graph neural networks (GNN). ,− …”

Section: Introductionmentioning

confidence: 99%

“…In this context, we can capitalize on prior domain knowledge to boost model performance. Since log D and log S w are highly correlated to log P , some studies have successfully attained better prediction via multitask learning. ,, In 2020, Lukashina et al introduced a substructure encoder representing functional groups as hyperatoms to provide complementary information for the directed message passing neural network (D-MPNN). The proposed StructGNN model outperformed D-MPNN on log P and log D predictions.…”

Section: Introductionmentioning

confidence: 99%

ALipSol: An Attention-Driven Mixture-of-Experts Model for Lipophilicity and Solubility Prediction

Wang

et al. 2022

J. Chem. Inf. Model.

View full text Add to dashboard Cite

Lipophilicity (logD) and aqueous solubility (logS w) play a central role in drug development. The accurate prediction of these properties remains to be solved due to data scarcity. Current methodologies neglect the intrinsic relationships between physicochemical properties and usually ignore the ionization effects. Here, we propose an attention-driven mixture-of-experts (MoE) model named ALipSol, which explicitly reproduces the hierarchy of task relationships. We adopt the principle of divide-and-conquer by breaking down the complex end point (logD or logS w) into simpler ones (acidic pK a, basic pK a, and logP) and allocating a specific expert network for each subproblem. Subsequently, we implement transfer learning to extract knowledge from related tasks, thus alleviating the dilemma of limited data. Additionally, we substitute the gating network with an attention mechanism to better capture the dynamic task relationships on a per-example basis. We adopt local fine-tuning and consensus prediction to further boost model performance. Extensive evaluation experiments verify the success of the ALipSol model, which achieves RMSE improvement of 8.04%, 2.49%, 8.57%, 12.8%, and 8.60% on the Lipop, ESOL, AqSolDB, external logD, and external logS data sets, respectively, compared with Attentive FP and the state-of-the-art in silico tools. In particular, our model yields more significant advantages (Welch’s t-test) for small training data, implying its high robustness and generalizability. The interpretability analysis proves that the atom contributions learned by ALipSol are more reasonable compared with the vanilla Attentive FP, and the substitution effects in benzene derivatives agreed well with empirical constants, revealing the potential of our model to extract useful patterns from data and provide guidance for lead optimization.

show abstract

Improved Lipophilicity and Aqueous Solubility Prediction with Composite Graph Neural Networks

Cited by 23 publications

References 15 publications

Machine Learning for Fast, Quantum Mechanics-Based Approximation of Drug Lipophilicity

Machine Learning for Fast, Quantum Mechanics-Based Approximation of Drug Lipophilicity

Novel Solubility Prediction Models: Molecular Fingerprints and Physicochemical Features vs Graph Convolutional Neural Networks

ALipSol: An Attention-Driven Mixture-of-Experts Model for Lipophilicity and Solubility Prediction

Contact Info

Product

Resources

About