2020
DOI: 10.1021/acs.jcim.0c00155
|View full text |Cite
|
Sign up to set email alerts
|

LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening

Abstract: Comparative evaluation of virtual screening methods requires a rigorous benchmarking procedure on diverse, realistic, and unbiased data sets. Recent investigations from numerous research groups unambiguously demonstrate that artificially constructed ligand sets classically used by the community (e.g., DUD, DUD-E, MUV) are unfortunately biased by both obvious and hidden chemical biases, therefore overestimating the true accuracy of virtual screening methods. We herewith present a novel data set (LIT-PCBA) speci… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
208
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 150 publications
(208 citation statements)
references
References 47 publications
0
208
0
Order By: Relevance
“…A number of data collections of different sizes have been designed from PubChem data and introduced to the scientific community, offering better references for evaluating novel virtual screening methods. Not counting nonpublicly available data sets (e.g., the three small- and medium-sized ligand sets that we designed in 2019 to validate our new pharmacophore-based ligand-aligning procedure [ 79 ]), in this section, we only mention open-access ones, including the Maximum Unbiased Validation (MUV) data sets [ 80 ], the UCI Machine-Learning Repository [ 81 ], the BCL::ChemInfo framework by Butkiewicz et al [ 82 ], the Lindh et al data collection [ 83 ], and our recently introduced LIT-PCBA ( Table 1 ) [ 22 ].…”
Section: What We Can Do With Pubchem Bioassay Data: From the Data mentioning
confidence: 99%
See 3 more Smart Citations
“…A number of data collections of different sizes have been designed from PubChem data and introduced to the scientific community, offering better references for evaluating novel virtual screening methods. Not counting nonpublicly available data sets (e.g., the three small- and medium-sized ligand sets that we designed in 2019 to validate our new pharmacophore-based ligand-aligning procedure [ 79 ]), in this section, we only mention open-access ones, including the Maximum Unbiased Validation (MUV) data sets [ 80 ], the UCI Machine-Learning Repository [ 81 ], the BCL::ChemInfo framework by Butkiewicz et al [ 82 ], the Lindh et al data collection [ 83 ], and our recently introduced LIT-PCBA ( Table 1 ) [ 22 ].…”
Section: What We Can Do With Pubchem Bioassay Data: From the Data mentioning
confidence: 99%
“…Five years later, we (Tran-Nguyen et al) developed and introduced a novel data collection entitled LIT-PCBA [ 22 ]. A rigorous systematic search was first performed on the ensemble of PubChem bioactivity assays, keeping only confirmatory screens conducted with over 10,000 substances, giving no fewer than 50 active molecules, against a single protein target having at least one crystal Protein Data Bank (PDB) structure bound to a drug-like ligand of the same phenotype as that of the confirmed actives.…”
Section: What We Can Do With Pubchem Bioassay Data: From the Data mentioning
confidence: 99%
See 2 more Smart Citations
“…There are efforts to construct sets using only known inactives (e.g. 19,24); however, these are relatively small in size and are not yet suitable for training modern machine learning methods.…”
Section: Introductionmentioning
confidence: 99%