2019
DOI: 10.26434/chemrxiv.7886165.v1
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Hidden Bias in the DUD-E Dataset Leads to Misleading Performance of Deep Learning in Structure-Based Virtual Screening

Abstract: <p>Recently much effort has been invested in using convolutional neural network (CNN) models trained on 3D structural images of protein-ligand complexes to distinguish binding from non-binding ligands for virtual screening. However, the dearth of reliable protein-ligand x-ray structures and binding affinity data has required the use of constructed datasets for the training and evaluation of CNN molecular recognition models. Here, we outline various sources of bias in one such widely-used dataset, the Dir… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(12 citation statements)
references
References 44 publications
0
12
0
Order By: Relevance
“…As we showed, this improves the performance of the trained models, which suggests that the newly predicted, formerly missing, labels are accurate. The common assumption of independent and identically distributed data has previously been recognised as problematic [21][22][23][25][26][27] . So, to show improvements in addressing missing labels, here we also developed a new method of model evaluation called block bootstrapping that is more computationally intensive but explicitly avoids biased evaluation by removing structural analogues from each and every test/train split.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…As we showed, this improves the performance of the trained models, which suggests that the newly predicted, formerly missing, labels are accurate. The common assumption of independent and identically distributed data has previously been recognised as problematic [21][22][23][25][26][27] . So, to show improvements in addressing missing labels, here we also developed a new method of model evaluation called block bootstrapping that is more computationally intensive but explicitly avoids biased evaluation by removing structural analogues from each and every test/train split.…”
Section: Discussionmentioning
confidence: 99%
“…To remedy this, we sought an evaluation method that is robust to possible memorization bias, also known as testtrain leakage. Recent LBVS literature has rightly pointed out that randomly selecting ligands from the training set to create a test set for evaluation can result in overly optimistic performance estimates that do not align with prospective validation [21][22][23] . This occurs because the ligands in most bioactivity datasets are not independent and identically distributed -the discovery of one active ligand often leads to many highly similar structural analogues with only a few changed atoms 24 , suggesting that the number of independent data points in most LBVS datasets is far fewer than the actual number of ligands.…”
Section: Introductionmentioning
confidence: 99%
“…D‐COID is another attempt at building a training dataset with the aim to generate highly compelling decoy complexes that are individually matched to active complexes 95 given that challenging decoys or negatives are not commonly used. An earlier well‐known dataset is the DUD‐E decoy compilation 61 that may include hidden bias 96 . vScreenML was trained as a general‐purpose classifier for virtual screening built on the XGBoost framework.…”
Section: Applications Of ML In Drug Designmentioning
confidence: 99%
“…A related problem is becoming apparent in the number of studies using ML but reusing datasets that may be incomplete or biased and thus increase the difficulty to compare and assess studies or approaches 23,96 . For example, robust, simpler techniques appear to be more transparently published and reproducible in QSAR publications 204 .…”
Section: Challenges and Outlook For ML For Ntdsmentioning
confidence: 99%
“…21,22 Among the challenges noted are the large chemical diversity in ligands, numerous classes of proteins with varying structural features, as well as biases in available testing sets. 23,24 When looking at the applicability domain of existing docking programs, some limitations are apparent. For example, covalent drugs have featured in recent reports, demonstrating their potential.…”
Section: Introductionmentioning
confidence: 99%