2020
DOI: 10.1101/2020.03.06.979625
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Dataset Augmentation Allows Deep Learning-Based Virtual Screening To Better Generalize To Unseen Target Classes, And Highlight Important Binding Interactions

Abstract: Current deep learning methods for structure-based virtual screening take the structures of both the protein and the ligand as input but make little or no use of the protein structure when predicting ligand binding. Here we show how a relatively simple method of dataset augmentation forces such deep learning methods to take into account information from the protein. Models trained in this way are more generalisable (make better predictions on protein-ligand complexes from a different distribution 1 to the train… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 28 publications
0
3
0
Order By: Relevance
“…Other techniques include data augmentation, basically increasing the data available by transforming and generating features (descriptors) based on the data already available. Using protein information to augment data has been shown to be able to improve DL postprocessing of virtual screening results for generalization (make better predictions on protein/ligand complexes from a different distribution to the training data) by forcing to include protein/ligand information into the model 98 . A problem for virtual screening is a large proportion of false positives.…”
Section: Applications Of ML In Drug Designmentioning
confidence: 99%
See 1 more Smart Citation
“…Other techniques include data augmentation, basically increasing the data available by transforming and generating features (descriptors) based on the data already available. Using protein information to augment data has been shown to be able to improve DL postprocessing of virtual screening results for generalization (make better predictions on protein/ligand complexes from a different distribution to the training data) by forcing to include protein/ligand information into the model 98 . A problem for virtual screening is a large proportion of false positives.…”
Section: Applications Of ML In Drug Designmentioning
confidence: 99%
“…Using protein information to augment data has been shown to be able to improve DL postprocessing of virtual screening results for generalization (make better predictions on protein/ligand complexes from a different distribution to the training data) by forcing to include protein/ligand information into the model. 98 A problem for virtual screening is a large proportion of false positives. Including more stringent decoys matched by molecular properties and binding conformations in an XGBoost procedure (gradient-boosted decision trees) showed notable separation of the scores assigned to active/decoy complexes along with a slight increase in MCC of 0.57.…”
Section: Applications Of ML In Drug Designmentioning
confidence: 99%
“…Therefore, multiple attempts have been made to approximate the binding free energy by minimizing or totally excluding the sampling step. This resulted in a considerable number of scoring functions, 131 which, in general, aim to approximate the free energy change upon binding. The binding Gibbs free energy can be written as 32,33 where the P superscript refers to the interactions with the protein, L - with the ligand, ⟨ U ⟩ and ⟨ W ⟩ are the averaged potential and solvation energies, respectively, and Δ S config is the entropy change related to protein and ligand motions upon complex formation.…”
Section: Introductionmentioning
confidence: 99%