Following studies in the late 1990s that indicated that poor pharmacokinetics and toxicity were important causes of costly late-stage failures in drug development, it has become widely appreciated that these areas should be considered as early as possible in the drug discovery process. However, in recent years, combinatorial chemistry and high-throughput screening have significantly increased the number of compounds for which early data on absorption, distribution, metabolism, excretion (ADME) and toxicity (T) are needed, which has in turn driven the development of a variety of medium and high-throughput in vitro ADMET screens. Here, we describe how in silico approaches will further increase our ability to predict and model the most relevant pharmacokinetic, metabolic and toxicity endpoints, thereby accelerating the drug discovery process.
In the pharmaceutical industry it is common to generate many QSAR models from training sets containing a large number of molecules and a large number of descriptors. The best QSAR methods are those that can generate the most accurate predictions but that are not overly expensive computationally. In this paper we compare eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a set of standard parameters at which XGBoost makes predictions, on the average, better than those of random forest and almost as good as those of deep neural nets. The biggest strength of XGBoost is its speed. Whereas efficient use of random forest requires generating each tree in parallel on a cluster, and deep neural nets are usually run on GPUs, XGBoost can be run on a single CPU in less than a third of the wall-clock time of either of the other methods.
ABSTRACT:Ligand-based computational models could be more readily shared between researchers and organizations if they were generated with open source molecular descriptors [e.g., chemistry development kit (
A critical review of a very recent work in the field of in silico ADME prediction is presented with emphasis on the work published during the period 2000-2002, and several other review articles are mentioned in order to offer a broader view of the field. We find that not much progress has been made in developing robust and predictive models, and that the lack of accurate data, together with the use of questionable modeling end-points, has greatly hindered the real progress in defining generally applicable models. Due to the largely empirical nature of QSAR/QSPR approaches, general and truly predictive models for complex phenomena, such as absorption and clearance, may still be chimeric. The development of local models for use within focused chemical series may be the most appropriate way of utilizing in silico ADME predictions, once experience and data have been gained on a given project and/or structural class.
Computational models of cytochrome P450 3A4 inhibition were developed based on high-throughput screening data for 4470 proprietary compounds. Multiple models differentiating inhibitors (IC(50) <3 microM) and noninhibitors were generated using various machine-learning algorithms (recursive partitioning [RP], Bayesian classifier, logistic regression, k-nearest-neighbor, and support vector machine [SVM]) with structural fingerprints and topological indices. Nineteen models were evaluated by internal 10-fold cross-validation and also by an independent test set. Three most predictive models, Barnard Chemical Information (BCI)-fingerprint/SVM, MDL-keyset/SVM, and topological indices/RP, correctly classified 249, 248, and 236 compounds of 291 noninhibitors and 135, 137, and 147 compounds of 179 inhibitors in the validation set. Their overall accuracies were 82%, 82%, and 81%, respectively. Investigating applicability of the BCI/SVM model found a strong correlation between the predictive performance and the structural similarity to the training set. Using Tanimoto similarity index as a confidence measurement for the predictions, the limitation of the extrapolation was 0.7 in the case of the BCI/SVM model. Taking consensus of the 3 best models yielded a further improvement in predictive capability, kappa = 0.65 and accuracy = 83%. The consensus model could also be tuned to minimize either false positives or false negatives depending on the emphasis of the screening.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.