2019
DOI: 10.26434/chemrxiv.7886165
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Hidden Bias in the DUD-E Dataset Leads to Misleading Performance of Deep Learning in Structure-Based Virtual Screening

Abstract: <p>Recently much effort has been invested in using convolutional neural network (CNN) models trained on 3D structural images of protein-ligand complexes to distinguish binding from non-binding ligands for virtual screening. However, the dearth of reliable protein-ligand x-ray structures and binding affinity data has required the use of constructed datasets for the training and evaluation of CNN molecular recognition models. Here, we outline various sources of bias in one such widely-used dataset, the Dir… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
93
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 62 publications
(97 citation statements)
references
References 0 publications
4
93
0
Order By: Relevance
“…We believe that this substantial reduction in bias will benefit the development and improve generalisation of structurebased virtual screening methods. Currently, methods can perform well on retrospective benchmarks without performing molecular recognition by simply learning underlying biases (3,22,27). Thus it is unclear if improvements are genuine or due to more closely capturing these biases.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…We believe that this substantial reduction in bias will benefit the development and improve generalisation of structurebased virtual screening methods. Currently, methods can perform well on retrospective benchmarks without performing molecular recognition by simply learning underlying biases (3,22,27). Thus it is unclear if improvements are genuine or due to more closely capturing these biases.…”
Section: Discussionmentioning
confidence: 99%
“…The reported results show that these methods substantially outperform other methodologies such as empirical and knowledgebased scoring functions at SBVS. Concerningly, some reports have suggested that a key driver of the performance of machine learning-based systems is hidden biases in the training data, such as physicochemical differences, and that these methods are not learning to perform molecular recognition (3,22). Better decoy molecules are essential to remove the biases in datasets that are hindering the development of virtual screening methods.…”
Section: Introductionmentioning
confidence: 99%
“…Also, 3-D structural information (especially the target-ligand complexes) is only available for a small portion of the DTI space; as a result, their coverage is comparably low and they generally are not suitable for large-scale DTI prediction. It is also important to note that the DUD-E benchmark dataset is reported to suffer from negative selection bias problem, 43 and thus, the results based on this dataset may not be conclusive.…”
Section: Large-scale Performance Evaluation and Comparisonmentioning
confidence: 97%
“…Another risk is the negative selection bias, where negative samples (i.e., inactive or non-binder compounds) in the training and/or test datasets are structurally similar to each other in a way, which is completely unrelated to their binding related properties. 43 So, a machine learning classier can easily exploit this feature to successfully separate them from the positives. Both of these cases would result in an overestimation of the model performance during benchmarks, especially when the tests are made to infer to performance of the models in predicting completely novel binders to the modelled target proteins.…”
Section: Sources Of Dataset Bias In Model Evaluationmentioning
confidence: 99%
See 1 more Smart Citation