Comparative evaluation of virtual screening methods requires a rigorous benchmarking procedure on diverse, realistic, and unbiased data sets. Recent investigations from numerous research groups unambiguously demonstrate that artificially constructed ligand sets classically used by the community (e.g., DUD, DUD-E, MUV) are unfortunately biased by both obvious and hidden chemical biases, therefore overestimating the true accuracy of virtual screening methods. We herewith present a novel data set (LIT-PCBA) specifically designed for virtual screening and machine learning. LIT-PCBA relies on 149 dose−response PubChem bioassays that were additionally processed to remove false positives and assay artifacts and keep active and inactive compounds within similar molecular property ranges. To ascertain that the data set is suited to both ligand-based and structure-based virtual screening, target sets were restricted to single protein targets for which at least one X-ray structure is available in complex with ligands of the same phenotype (e.g., inhibitor, inverse agonist) as that of the PubChem active compounds. Preliminary virtual screening on the 21 remaining target sets with state-of-the-art orthogonal methods (2D fingerprint similarity, 3D shape similarity, molecular docking) enabled us to select 15 target sets for which at least one of the three screening methods is able to enrich the top 1%-ranked compounds in true actives by at least a factor of 2. The corresponding ligand sets (training, validation) were finally unbiased by the recently described asymmetric validation embedding (AVE) procedure to afford the LIT-PCBA data set, consisting of 15 targets and 7844 confirmed active and 407,381 confirmed inactive compounds. The data set mimics experimental screening decks in terms of hit rate (ratio of active to inactive compounds) and potency distribution. It is available online at http:// drugdesign.unistra.fr/LIT-PCBA for download and for benchmarking novel virtual screening methods, notably those relying on machine learning.
Hundreds of fast scoring functions have been developed over the last 20 years to predict binding free energies from three-dimensional structures of protein-ligand complexes. Despite numerous statistical promises, we believe that none of them has been properly validated for daily prospective high-throughput virtual screening studies, mostly because in silico screening challenges usually employ artificially built and biased datasets. We here carry out a fully unbiased evaluation of four scoring functions (Pafnucy, ΔvinaRF20, IFP, and GRIM) on an in-house developed data collection of experimental high-confidence screening data (LIT-PCBA) covering about 3 million data points on 15 diverse pharmaceutical targets. All four scoring functions were applied to rescore the docking poses of LIT-PCBA compounds in conditions mimicking exactly standard drug discovery scenarios and were compared in terms of propensity to enrich true binders in the top 1%-ranked hit lists. Interestingly, rescoring based on simple interaction fingerprints or interaction graphs outperforms state-of-the-art machine learning and deep learning scoring functions in most of the cases. The current study notably highlights the strong tendency of deep learning methods to predict affinity values within a very narrow range centered on the mean value of samples used for training. Moreover, it suggests that knowledge of pre-existing binding modes is the key to detecting the most potent binders.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.