On the order of hundreds of absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) models have been described in the literature in the past decade which are more often than not inaccessible to anyone but their authors. Public accessibility is also an issue with computational models for bioactivity, and the ability to share such models still remains a major challenge limiting drug discovery. We describe the creation of a reference implementation of a Bayesian model-building software module, which we have released as an open source component that is now included in the Chemistry Development Kit (CDK) project, as well as implemented in the CDD Vault and in several mobile apps. We use this implementation to build an array of Bayesian models for ADME/Tox, in vitro and in vivo bioactivity, and other physicochemical properties. We show that these models possess cross-validation receiver operator curve values comparable to those generated previously in prior publications using alternative tools. We have now described how the implementation of Bayesian models with FCFP6 descriptors generated in the CDD Vault enables the rapid production of robust machine learning models from public data or the user’s own datasets. The current study sets the stage for generating models in proprietary software (such as CDD) and exporting these models in a format that could be run in open source software using CDK components. This work also demonstrates that we can enable biocomputation across distributed private or public datasets to enhance drug discovery.
The search for molecules with activity against Mycobacterium tuberculosis (Mtb) is employing many approaches in parallel including high throughput screening and computational methods. We have developed a database (CDD TB) to capture public and private Mtb data while enabling data mining and collaborations with other researchers. We have used the public data along with several cheminformatics approaches to produce models that describe active and inactive compounds. We have compared these datasets to those for known FDA approved drugs and between Mtb active and inactive compounds. The distribution of polar surface area and pK(a) of active compounds was found to be a statistically significant determinant of activity against Mtb. Hydrophobicity was not always statistically significant. Bayesian classification models for 220, 463 molecules were generated and tested with external molecules, and enabled the discrimination of active or inactive substructures from other datasets in the CDD TB. Computational pharmacophores based on known Mtb drugs were able to map to and retrieve a small subset of some of the Mtb datasets, including a high percentage of Mtb actives. The combination of the database, dataset analysis, Bayesian and pharmacophore models provides new insights into molecular properties and features that are determinants of activity in whole cells. This study provides novel insights into the key 1D molecular descriptors, 2D chemical substructures and 3D pharmacophores which can be used to mine the chemistry space, prioritizing those molecules with a higher probability of activity against Mtb.
There is an urgent need for new drugs against tuberculosis which annually claims 1.7-1.8 million lives. One approach to identify potential leads is to screen in vitro small molecules against Mycobacterium tuberculosis (Mtb). Until recently there was no central repository to collect information on compounds screened. Consequently, it has been difficult to analyze molecular properties of compounds that inhibit the growth of Mtb in vitro. We have collected data from publically available sources on over 300 000 small molecules deposited in the Collaborative Drug Discovery TB Database. A cheminformatics analysis on these compounds indicates that inhibitors of the growth of Mtb have statistically higher mean logP, rule of 5 alerts, while also having lower HBD count, atom count and lower PSA (ChemAxon descriptors), compared to compounds that are classed as inactive. Additionally, Bayesian models for selecting Mtb active compounds were evaluated with over 100 000 compounds and, they demonstrated 10 fold enrichment over random for the top ranked 600 compounds. This represents a promising approach for finding compounds active against Mtb in whole cells screened under the same in vitro conditions. Various sets of Mtb hit molecules were also examined by various filtering rules used widely in the pharmaceutical industry to identify compounds with potentially reactive moieties. We found differences between the number of compounds flagged by these rules in Mtb datasets, malaria hits, FDA approved drugs and antibiotics. Combining these approaches may enable selection of compounds with increased probability of inhibition of whole cell Mtb activity.
Bayesian models constructed from structure-derived fingerprints have been a popular and useful method for drug discovery research when applied to bioactivity measurements that can be effectively classified as active or inactive. The results can be used to rank candidate structures according to their probability of activity, and this ranking benefits from the high degree of interpretability when structure-based fingerprints are used, making the results chemically intuitive. Besides selecting an activity threshold, building a Bayesian model is fast and requires few or no parameters or user intervention. The method also does not suffer from such acute overtraining problems as quantitative structure–activity relationships or quantitative structure–property relationships (QSAR/QSPR). This makes it an approach highly suitable for automated workflows that are independent of user expertise or prior knowledge of the training data. We now describe a new method for creating a composite group of Bayesian models to extend the method to work with multiple states, rather than just binary. Incoming activities are divided into bins, each covering a mutually exclusive range of activities. For each of these bins, a Bayesian model is created to model whether or not the compound belongs in the bin. Analyzing putative molecules using the composite model involves making a prediction for each bin and examining the relative likelihood for each assignment, for example, highest value wins. The method has been evaluated on a collection of hundreds of data sets extracted from ChEMBL v20 and validated data sets for ADME/Tox and bioactivity.
We are now seeing the benefit of investments made over the last decade in high-throughput screening (HTS) that is resulting in large structure activity datasets entering public and open databases such as ChEMBL and PubChem. The growth of academic HTS screening centers and the increasing move to academia for early stage drug discovery suggests a great need for the informatics tools and methods to mine such data and learn from it. Collaborative Drug Discovery, Inc. (CDD) has developed a number of tools for storing, mining, securely and selectively sharing, as well as learning from such HTS data. We present a new web based data mining and visualization module directly within the CDD Vault platform for high-throughput drug discovery data that makes use of a novel technology stack following modern reactive design principles. We also describe CDD Models within the CDD Vault platform that enables researchers to share models, share predictions from models, and create models from distributed, heterogeneous data. Our system is built on top of the Collaborative Drug Discovery Vault Activity and Registration data repository ecosystem which allows users to manipulate and visualize thousands of molecules in real time. This can be performed in any browser on any platform. In this chapter we present examples of its use with public datasets in CDD Vault. Such approaches can complement other cheminformatics tools, whether open source or commercial, in providing approaches for data mining and modeling of HTS data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.