Background Determination of acute toxicity, expressed as median lethal dose (LD50), is one of the most important steps in drug discovery pipeline. Because in vivo assays for oral acute toxicity in mammals are time-consuming and costly, there is thus an urgent need to develop in silico prediction models of oral acute toxicity. ResultsIn this study, based on a comprehensive data set containing 7314 diverse chemicals with rat oral LD50 values, relevance vector machine (RVM) technique was employed to build the regression models for the prediction of oral acute toxicity in rate, which were compared with those built using other six machine learning approaches, including k-nearest-neighbor regression, random forest (RF), support vector machine, local approximate Gaussian process, multilayer perceptron ensemble, and eXtreme gradient boosting. A subset of the original molecular descriptors and structural fingerprints (PubChem or SubFP) was chosen by the Chi squared statistics. The prediction capabilities of individual QSAR models, measured by qext2 for the test set containing 2376 molecules, ranged from 0.572 to 0.659.ConclusionConsidering the overall prediction accuracy for the test set, RVM with Laplacian kernel and RF were recommended to build in silico models with better predictivity for rat oral acute toxicity. By combining the predictions from individual models, four consensus models were developed, yielding better prediction capabilities for the test set (qext2 = 0.669–0.689). Finally, some essential descriptors and substructures relevant to oral acute toxicity were identified and analyzed, and they may be served as property or substructure alerts to avoid toxicity. We believe that the best consensus model with high prediction accuracy can be used as a reliable virtual screening tool to filter out compounds with high rat oral acute toxicity. Graphical abstractWorkflow of combinatorial QSAR modelling to predict rat oral acute toxicity
Adverse effects induced by drug–drug interactions may result in early termination of drug development or even withdrawal of drugs from the market, and many drug–drug interactions are caused by the inhibition of cytochrome P450 (CYP450) enzymes. Therefore, the accurate prediction of the inhibition capability of a given compound against a specific CYP450 isoform is highly desirable. In this study, three ensemble learning methods, including random forest, gradient boosting decision tree, and eXtreme gradient boosting (XGBoost), and two deep learning methods, including deep neural networks and convolutional neural networks, were used to develop classification models to discriminate inhibitors and noninhibitors for five major CYP450 isoforms (1A2, 2C9, 2C19, 2D6, and 3A4). The results demonstrate that the ensemble learning models generally give better predictions than the deep learning models for the external test sets. Among all of the models, the XGBoost models achieve the best classification capability (average prediction accuracy of 90.4%) for the test sets, which even outperform the previously reported model developed by the multitask deep autoencoder neural network (88.5%). The Shapley additive explanation method was then used to interpret the models and analyze the misclassified molecules. The important molecular descriptors given by our models are consistent with the structural preferences for inhibitors of different CYP450 isoforms, which may provide valuable clues to detect potential drug–drug interactions in the early stage of drug discovery.
Molecular docking provides a computationally efficient way to predict the atomic structural details of protein-RNA interactions (PRI), but accurate prediction of the three-dimensional structures and binding affinities for PRI is still notoriously difficult, partly due to the unreliability of the existing scoring functions for PRI. MM/PBSA and MM/GBSA are more theoretically rigorous than most scoring functions for protein-RNA docking, but their prediction performance for protein-RNA systems remains unclear. Here, we systemically evaluated the capability of MM/PBSA and MM/GBSA to predict the binding affinities and recognize the near-native binding structures for protein-RNA systems with different solvent models and interior dielectric constants (). For predicting the binding affinities, the predictions given by MM/GBSA based on the minimized structures in explicit solvent and the GB model with = 2 yielded the highest correlation with the experimental data. Moreover, the MM/GBSA calculations based on the minimized structures in implicit solvent and the GB model distinguished the near-native binding structures within the top 10 decoys for 117 out of the 148 protein-RNA systems (79.1%). This performance is better than all docking scoring functions studied here. Therefore, the MM/GBSA rescoring is an efficient way to improve the prediction capability of scoring functions for protein-RNA systems.
Xenobiotic chemicals and their metabolites are mainly excreted out of our bodies by the urinary tract through the urine. Chemical-induced urinary tract toxicity is one of the main reasons that cause failure during drug development, and it is a common adverse event for medications, natural supplements, and environmental chemicals. Despite its importance, there are only a few in silico models for assessing urinary tract toxicity for a large number of compounds with diverse chemical structures. Here, we developed a series of qualitative and quantitative structure-activity relationship (QSAR) models for predicting urinary tract toxicity. In our study, the recursive feature elimination method incorporated with random forests (RFE-RF) was used for dimension reduction, and then eight machine learning approaches were used for QSAR modeling, i.e., relevance vector machine (RVM), support vector machine (SVM), regularized random forest (RRF), C5.0 trees, eXtreme gradient boosting (XGBoost), AdaBoost.M1, SVM boosting (SVMBoost), and RVM boosting (RVMBoost). For building classification models, the synthetic minority oversampling technique was used to handle the imbalance data set problem. Among all the machine learning approaches, SVMBoost based on the RBF kernel achieves both the best quantitative (q = 0.845) and qualitative predictions for the test set (MCC of 0.787, AUC of 0.893, sensitivity of 89.6%, specificity of 94.1%, and global accuracy of 90.8%). The application domains were then analyzed, and all of the tested chemicals fall within the application domain coverage. We also examined the structure features of the chemicals with large prediction errors. In brief, both the regression and classification models developed by the SVMBoost approach have reliable prediction capability for assessing chemical-induced urinary tract toxicity.
Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure–activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms [linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR)], three analogizers [radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR)], six symbolists [extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART)] and two connectionists [principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN)] were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.