Multitask Machine Learning for Classifying Highly and Weakly Potent Kinase Inhibitors

Rodríguez-Pérez, Raquel; Bajorath, Jürgen

doi:10.1021/acsomega.9b00298

Cited by 62 publications

(53 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Rather than searching for similar molecules, machine learning models are trained to predict the activities of molecules based on their fingerprints. [8][9][10][11] This bypasses the need for similarity search but these approaches still rely, at its core, on precalculated fingerprints. A new class of ML algorithms, called Graph Neural Networks (GNN) are thought to overcome the calculation of fingerprints.…”

Section: Introductionmentioning

confidence: 99%

Using Domain-Specific Fingerprints Generated Through Neural Networks to Enhance Ligand-Based Virtual Screening

Menke

Koch

2021

J. Chem. Inf. Model.

View full text Add to dashboard Cite

Molecular fingerprints are essential for different cheminformatics approaches like similarity-based virtual screening. In this work, the concept of neural (network) fingerprints in the context of similarity search is introduced in which the activation of the last hidden layer of a trained neural network represents the molecular fingerprint. The neural fingerprint performance of five different neural network architectures was analyzed and compared to the well-established Extended Connectivity Fingerprint (ECFP) and an autoencoder-based fingerprint. This is done using a published compound dataset with known bioactivity on 160 different kinase targets. We expect neural networks to combine information about the molecular space of already known bioactive compounds together with the information on the molecular structure of the query and by doing so enrich the fingerprint. The results show that indeed neural fingerprints can greatly improve the performance of similarity searches. Most importantly, it could be shown that the neural fingerprint performs well even for kinase targets that were not included in the training. Surprisingly, while Graph Neural Networks (GNNs) are thought to offer an advantageous alternative, the best performing neural fingerprints were based on traditional fully connected layers using the ECFP4 as input. The best performing kinase-specific neural fingerprint will be provided for public use.

show abstract

Section: Introductionmentioning

confidence: 99%

Using Domain-Specific Fingerprints Generated Through Neural Networks to Enhance Ligand-Based Virtual Screening

Menke

Koch

2021

J. Chem. Inf. Model.

View full text Add to dashboard Cite

show abstract

“…As a methodologically distinct application, MT-DNNs were trained for predicting highly and weakly potent inhibitors of different kinases and predictions were interpreted. The feasibility of such predictions was demonstrated previously [41]. The architectures of MT-DNN models contained multiple output neurons, each of which represented a different prediction task (target).…”

Section: Multi-target Activity Predictionmentioning

confidence: 90%

Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions

Rodríguez-Pérez

Bajorath

2020

J Comput Aided Mol Des

Self Cite

382

203

View full text Add to dashboard Cite

Difficulties in interpreting machine learning (ML) models and their predictions limit the practical applicability of and confidence in ML in pharmaceutical research. There is a need for agnostic approaches aiding in the interpretation of ML models regardless of their complexity that is also applicable to deep neural network (DNN) architectures and model ensembles. To these ends, the SHapley Additive exPlanations (SHAP) methodology has recently been introduced. The SHAP approach enables the identification and prioritization of features that determine compound classification and activity prediction using any ML model. Herein, we further extend the evaluation of the SHAP methodology by investigating a variant for exact calculation of Shapley values for decision tree methods and systematically compare this variant in compound activity and potency value predictions with the model-independent SHAP method. Moreover, new applications of the SHAP analysis approach are presented including interpretation of DNN models for the generation of multi-target activity profiles and ensemble regression models for potency prediction.

show abstract

“…Clustering-based validation strategies have been used to avoid the compound series bias, making sure that there are no similar molecules both in training, validation and test sets. 18,26,27 We followed the implementation of our previous study on cross-validation strategies in PCM, 8 where K-means clustering with k = 100 was applied to the fingerprint description of the compounds. Data was divided in training, validation and test sets with a proportion of 80/10/10%.…”

Section: Validation Strategymentioning

confidence: 99%

Balancing Data on Proteochemometrics Activity Classification

Rio

Picart²,

Perera-Lluna³

2021

Preprint

View full text Add to dashboard Cite

<div>In silico analysis of biological activity data has become an essential technique in pharmaceutical development. </div><div>Specifically, the so-called proteochemometric models aim to share information between targets in machine learning ligand-target activity prediction models. </div><div>However, bioactivity datasets used in proteochemometrics modeling are usually imbalanced, which could potentially affect the performance of the models. In this work, we explored the effect of different balancing strategies in deep learning proteochemometric target-compound activity classification models while controlling for the compound series bias through clustering. These strategies were: (1) no_resampling, (2) resampling_after_clustering, (3) resampling_before_clustering and (4) semi_resampling. </div><div>These schemas were evaluated in kinases and GPCRs from BindingDB. </div><div>We observed that the predicted proportion of positives was driven by the actual data balance in the test set. </div><div>Additionally, it was confirmed that data balance had an impact on the performance estimates of the proteochemometrics model. </div><div>We recommend a combination of data augmentation and clustering in the training set (semi_resampling) in order to mitigate the data imbalance effect in a realistic scenario. </div><div>The code of this analysis is publicly available at https://github.com/b2slab/imbalance_pcm_benchmark.</div>

show abstract

Multitask Machine Learning for Classifying Highly and Weakly Potent Kinase Inhibitors

Cited by 62 publications

References 32 publications

Using Domain-Specific Fingerprints Generated Through Neural Networks to Enhance Ligand-Based Virtual Screening

Using Domain-Specific Fingerprints Generated Through Neural Networks to Enhance Ligand-Based Virtual Screening

Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions

Balancing Data on Proteochemometrics Activity Classification

Contact Info

Product

Resources

About