Carcinogenicity refers to a highly toxic end point of certain chemicals, and has become an important issue in the drug development process. In this study, three novel ensemble classification models, namely Ensemble SVM, Ensemble RF, and Ensemble XGBoost, were developed to predict carcinogenicity of chemicals using seven types of molecular fingerprints and three machine learning methods based on a dataset containing 1003 diverse compounds with rat carcinogenicity. Among these three models, Ensemble XGBoost is found to be the best, giving an average accuracy of 70.1 ± 2.9%, sensitivity of 67.0 ± 5.0%, and specificity of 73.1 ± 4.4% in five-fold cross-validation and an accuracy of 70.0%, sensitivity of 65.2%, and specificity of 76.5% in external validation. In comparison with some recent methods, the ensemble models outperform some machine learning-based approaches and yield equal accuracy and higher specificity but lower sensitivity than rule-based expert systems. It is also found that the ensemble models could be further improved if more data were available. As an application, the ensemble models are employed to discover potential carcinogens in the DrugBank database. The results indicate that the proposed models are helpful in predicting the carcinogenicity of chemicals. A web server called CarcinoPred-EL has been built for these models (http://ccsipb.lnu.edu.cn/toxicity/CarcinoPred-EL/).
Drug-induced liver injury (DILI) is a major safety concern in the drug-development process, and various methods have been proposed to predict the hepatotoxicity of compounds during the early stages of drug trials. In this study, we developed an ensemble model using 3 machine learning algorithms and 12 molecular fingerprints from a dataset containing 1241 diverse compounds. The ensemble model achieved an average accuracy of 71.1 ± 2.6%, sensitivity (SE) of 79.9 ± 3.6%, specificity (SP) of 60.3 ± 4.8%, and area under the receiver-operating characteristic curve (AUC) of 0.764 ± 0.026 in 5-fold cross-validation and an accuracy of 84.3%, SE of 86.9%, SP of 75.4%, and AUC of 0.904 in an external validation dataset of 286 compounds collected from the Liver Toxicity Knowledge Base. Compared with previous methods, the ensemble model achieved relatively high accuracy and SE. We also identified several substructures related to DILI. In addition, we provide a web server offering access to our models (http://ccsipb.lnu.edu.cn/toxicity/HepatoPred-EL/).
The prediction of compound cytotoxicity is an important part of the drug discovery process. However, it usually appears as poor predictive performance because the datasets are high‐throughput and have a class‐imbalance problem. In this study, several strategies of performing a structure‐activity relationship study for a cytotoxic endpoint in the AID364 dataset were explored to solve the class‐imbalance problem. Random forest adaboost was used as the base learners for 10 types of molecular fingerprints and an ensemble method and six data‐balancing methods were applied to balance the classes. As a result, the ensemble model using MACCS fingerprint was found to be the best, giving area under the curve of 85.2% ± 0.35%, sensitivity of 81.8% ± 0.65%, and specificity of 76.0% ± 0.12% in fivefold cross‐validation and area under the curve of 78.8%, sensitivity of 55.5% and specificity of 78.5% in external validation. Good performance also appeared on other datasets with different sizes/degrees of imbalance. To explore the structural commonality of cytotoxic compounds, several substructures were identified as an important reference for substructure alerts. The convincing results indicate that the proposed models are helpful in predicting the cytotoxicity of chemicals.
Protein phosphorylation is involved in most cellular functions. Because of the importance of protein phosphorylation, many methods are conducted to identify the phosphorylation sites. Experimental methods for identifying phosphorylation sites are not only costly but also time consuming. Hence, computational methods are highly desired. In this paper, three new encoding methods, BinCTF(Binary-conjoint triad feature), CTF2(new conjoint triad feature) and BinCTF2(Binary-new conjoint triad feature), which are the modification of Binary and CTF encoding, are developed. Then an ensemble support vector machine is applied to predict the phosphorylation sites related to serine (S), threonine (T) and tyrosine (Y) residues. The numerical results indicate that some of the performance of these new methods are better than previous methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.