Predict Ionization Energy of Molecules Using Conventional and Graph-Based Machine Learning Models

Liu, Yufeng; Li, Zhenyu

doi:10.1021/acs.jcim.2c01321

Cited by 9 publications

(9 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In fact, several new models for compound-kinase binding prediction are introduced every month [CCAS + 15, CRA + 21, DQJ + 22, DSSGP22]. They differ in the learning algorithm used, such as simple k-nearest neighbor regression [BHS + 21], decision trees [TAA + 22], kernel learning [MM12, NPC16, CRP + 17, CPS + 18] and deep learning methods [BHS + 21, O18, KZEK23, LLP23, SSB + 23], as well as compound and protein descriptors, including compound SMILES and graphs [DTME20], protein amino acid sequences [BHS + 21, KZEK23] and, lately, more complex 3D structure-based features [KZK + 23, PHL + 23, LKN + 23, LTZ + 23] and embeddings from pretrained large language models [SSB + 23]. Most recent methods modeling compound-kinase activities learn from the descriptors of both compounds and kinases, and are referred to as proteochemometric models.…”

Section: Introductionmentioning

confidence: 99%

Leveraging multiple data types for improved compound-kinase bioactivity prediction

Theisen,

Wang,

Ravikumar

et al. 2024

Preprint

View full text Add to dashboard Cite

Machine learning methods offer time- and cost-effective means for identifying novel chemical matter as well as guiding experimental efforts to map enormous compound-kinase interaction spaces. However, considerable challenges for compound-kinase interaction modeling arise from the heterogeneity of available bioactivity readouts, including single-dose compound profiling results, such as percentage inhibition, and multi-dose-response results, such as IC50. Standard activity prediction approaches utilize only dose-response data in the model training, disregarding a substantial portion of available information contained in single-dose measurements. Here, we propose a novel machine learning methodology for compound-kinase activity prediction that leverages both single-dose and dose-response data. Our two-stage model first learns a mapping between single-dose and dose-response bioactivity readouts, and then generates proxy dose-response activity labels for compounds that have only been tested in single-dose assays. The predictions from the first-stage model are then integrated with experimentally measured dose-response activities to model compound-kinase binding based on chemical structures and kinase features. We demonstrate that our two-stage approach yields accurate activity predictions and significantly improves model performance compared to training solely on dose-response labels, particularly in the most practical and challenging scenarios of predicting activities for new compounds and new compound scaffolds. This superior performance is consistent across five evaluated machine learning methods, including traditional models such as random forest and kernel learning, as well as deep learning-based approaches. Using the best performing model, we carried out extensive experimental profiling on a total of 347 selected compound-kinase pairs, achieving a high hit rate of 40% and a negative predictive value of 78%. We show that these rates can be improved further by incorporating model uncertainty estimates into the compound selection process. By integrating multiple activity data types, we demonstrate that our approach holds promise for facilitating the development of training activity datasets in a more efficient and cost-effective way.

show abstract

Section: Introductionmentioning

confidence: 99%

Leveraging multiple data types for improved compound-kinase bioactivity prediction

Theisen,

Wang,

Ravikumar

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…However, such explicit knowledge about ligand pose in the binding pocket may be unnecessary for predicting ligand binding scores. Indeed, several methods have been proposed that predict activity given just the ligand chemical graph representation and the 3D receptor structure or the ligand graph and the receptor amino acid sequence. − Additionally, it is possible to train models that rely solely on docked poses, as is done by Liu et al…”

Section: Introductionmentioning

confidence: 99%

“…Indeed, several methods have been proposed that predict activity given just the ligand chemical graph representation and the 3D receptor structure 20 or the ligand graph and the receptor amino acid sequence. 21−23 Additionally, it is possible to train models that rely solely on docked poses, as is done by Liu et al 24 As mentioned above, deep learning has proven effective when using much more data than the 20K activity data points available in PDBbind or CrossDocked. Thus, we hypothesized that an expanded data set with orders of magnitude more binding data would result in more accurate models for predicting binders to novel proteins.…”

Section: ■ Introductionmentioning

confidence: 99%

“…Other data sets exist that map protein 3D structure (or sequence) to ligand activity without crystal poses, but none have the scope of BigBind. DAVIS, KIBA, and KinCo, for instance, are limited to kinases. Benchmarking data sets such as DUD-E, DEKOIS, and LIT-PCBA also contain activity data without crystal poses, but these are designed to benchmark rather than train SBVS models.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

BigBind: Learning from Nonstructural Data for Structure-Based Virtual Screening

Brocidiacono,

Francoeur,

Aggarwal

et al. 2023

J. Chem. Inf. Model.

View full text Add to dashboard Cite

Deep learning methods that predict protein−ligand binding have recently been used for structure-based virtual screening. Many such models have been trained using protein− ligand complexes with known crystal structures and activities from the PDBBind data set. However, because PDBbind only includes 20K complexes, models typically fail to generalize to new targets, and model performance is on par with models trained with only ligand information. Conversely, the ChEMBL database contains a wealth of chemical activity information but includes no information about binding poses. We introduce BigBind, a data set that maps ChEMBL activity data to proteins from the CrossDocked data set. BigBind comprises 583 K ligand activities and includes 3D structures of the protein binding pockets. Additionally, we augmented the data by adding an equal number of putative inactives for each target. Using this data, we developed BANANA (basic neural network for binding affinity), a neural network-based model to classify active from inactive compounds, defined by a 10 μM cutoff. Our model achieved an AUC of 0.72 on BigBind's test set, while a ligand-only model achieved an AUC of 0.59. Furthermore, BANANA achieved competitive performance on the LIT-PCBA benchmark (median EF1% 1.81) while running 16,000 times faster than molecular docking with GNINA. We suggest that BANANA, as well as other models trained on this data set, will significantly improve the outcomes of prospective virtual screening tasks.

show abstract

Section: Introductionmentioning

confidence: 99%

BigBind: Learning from Nonstructural Data for Structure-Based Virtual Screening

Brocidiacono,

Francoeur,

Aggarwal

et al. 2023

Preprint

View full text Add to dashboard Cite

Deep learning methods that predict protein-ligand binding have recently been used for structure-based virtual screening. Many such models have been trained using protein-ligand complexes with known crystal structures and activities from the PDBBind dataset. However, because PDBbind only includes 20K complexes, models typically fail to generalize to new targets, and model performance is on par with models trained with only ligand information. Conversely, the ChEMBL database contains a wealth of chemical activity information but includes no information about binding poses. We introduce BigBind, a dataset that maps ChEMBL activity data to proteins from the CrossDocked dataset. BigBind comprises 583K ligand activities and includes 3D structures of the protein binding pockets. Additionally, we augmented the data by adding an equal number of putative inactives for each target. Using this data, we developed BANANA (BAsic NeurAl Network for binding Affinity), a neural network-based model to classify active from inactive compounds, defined by a 10 μM cutoff. Our model achieved an AUC of 0.72 on BigBind’s test set, while a ligand-only model achieved an AUC of 0.59. Furthermore, BANANA achieved competitive performance on the LIT-PCBA benchmark (median EF1\% 1.81) while running 16,000 times faster than molecular docking with GNINA. We suggest that BANANA, as well as other models trained on this dataset, will significantly improve the outcomes of prospective virtual screening tasks.

show abstract

Predict Ionization Energy of Molecules Using Conventional and Graph-Based Machine Learning Models

Cited by 9 publications

References 45 publications

Leveraging multiple data types for improved compound-kinase bioactivity prediction

Leveraging multiple data types for improved compound-kinase bioactivity prediction

BigBind: Learning from Nonstructural Data for Structure-Based Virtual Screening

BigBind: Learning from Nonstructural Data for Structure-Based Virtual Screening

Contact Info

Product

Resources

About