New computational approaches for virtual screening applications are constantly being developed. However, before a particular tool is used to search for new active compounds, its effectiveness in the type of task must be examined. In this study, we conducted a detailed analysis of various aspects of preparation of respective data sets for such an evaluation. We propose a protocol for fetching data from the ChEMBL database, examine various compound representations in terms of the possible bias resulting from the way they are generated, and define a new metric for comparing the structural similarity of compounds, which is in line with chemical intuition. The newly developed method is also used for the evaluation of various approaches for division of the data set into training and test set parts, which are also examined in detail in terms of being the source of possible results bias. Finally, machine learning methods are applied in cross-validation studies of data sets constructed within the paper, constituting benchmarks for the assessment of computational methods developed for virtual screening tasks. Additionally, analogous data sets for class A G protein-coupled receptors (100 targets with the highest number of records) were prepared. They are available at http://gmum.net/benchmarks/, together with script enabling reproduction of all results available at https://github.com/lesniak43/ananas.
Three-dimensional descriptors are often used to search for new biologically active compounds, in both ligand- and structure-based approaches, capturing the spatial orientation of molecules. They frequently constitute an input for machine learning-based predictions of compound activity or quantitative structure–activity relationship modeling; however, the distribution of their values and the accuracy of depicting compound orientations might have an impact on the power of the obtained predictive models. In this study, we analyzed the distribution of three-dimensional descriptors calculated for docking poses of active and inactive compounds for all aminergic G protein-coupled receptors with available crystal structures, focusing on the variation in conformations for different receptors and crystals. We demonstrated that the consistency in compound orientation in the binding site is rather not correlated with the affinity itself, but is more influenced by other factors, such as the number of rotatable bonds and crystal structure used for docking studies. The visualizations of the descriptors distributions were prepared and made available online at
http://chem.gmum.net/vischem_stability
, which enables the investigation of chemical structures referring to particular data points depicted in the figures. Moreover, the performed analysis can assist in choosing crystal structure for docking studies, helping in selection of conditions providing the best discrimination between active and inactive compounds in machine learning-based experiments.
Graphical abstract
Electronic supplementary material
The online version of this article (10.1007/s11030-018-9894-4) contains supplementary material, which is available to authorized users.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.