Ubiquitination-site prediction is an important task because ubiquitination is a critical regulatory function for many biological processes such as proteasome degradation, DNA repair and transcription, signal transduction, endocytoses, and sorting. However, the highly dynamic and reversible nature of ubiquitination makes it difficult to experimentally identify specific ubiquitination sites. In this paper, we explore the possibility of improving the prediction of ubiquitination sites using ensemble machine learning methods including Random Forrest (RF), Adaptive Boosting (ADB), Gradient Boosting (GB), and eXtreme Gradient Boosting (XGB). By doing grid search with the four ensemble methods and six comparison non-ensemble learning methods including Naive Base (NB), Logistic Regression (LR), Decision Trees (DT), Support Vector Machine (SVM), LASSO, and K-Nearest Neighbor (KNN), we find that all the four ensemble methods significantly outperform one or more non-ensemble methods included in this study. XGB outperforms three out of the six non-ensemble methods that we included; ADB and RF both outperform two of the six non-ensemble methods; GB outperforms one non-ensemble method. Comparing the four ensemble methods among themselves. GB performs the worst; XGB and ADB are very comparable in terms of prediction, but ADB beats XGB by far in terms of both the unit model training time and total running time. Both XGB and ADB tend to do better than RF in terms of prediction, but RF has the shortest unit model training time out of the three. In addition, we notice that ADB tends to outperform XGB when dealing with small-scale datasets, and RF can outperform either ADB or XGB when data are less balanced. Interestingly, we find that SVM, LR, and LASSO, three of the six non-ensemble methods included, perform comparably with all the ensemble methods. Based on this study, ensemble learning is a promising approach to ignificantly improving ubiquitination-site prediction using protein segment data.
Background Ubiquitination plays an important role in protein post-translational processes and has been found to be involved in a number of regulatory functions including proteasome degradation, DNA repair, transcription, signal transduction, endocytosis, and sorting. As the identification of ubiquitination site is critical to furthering our understanding of the mechanism of ubiquitination, various experimental and machine learning methods have been used to conduct this task. It has been an important but challenging task to improve the accuracy of ubiquitination site prediction. In this research, we explore the possibility of improving the prediction performance of machine learning by incorporating grid search in the training process. Method We developed grid search procedures for each of six widely used machine learning methods including NB, LR, DT, SVM, LASSO, and KNN, and applied them to ubiquitination site prediction using the six PCP datasets that were previously developed. For each of the ML methods, we developed a set of values for each of the tunable hyperparameters available to the method. These sets of values then can be combined to form a large grid of hyperparameter settings, and each of these settings is used in the grid search. We integrated 5-fold cross-validation in grid search to train and test ML models and applied an additional independent validation procedure by conducting a pre-training 80-20 sample split. We evaluated the performance of the six methods by comparing them side by side for each of the six datasets. We also compared the grid search results with the results that were previously published without doing grid search. To optimize the prediction performance, we trained 1.1 million ML models in total through grid search. Results We compared the overall prediction performance of these six methods, as well as their prediction performance when working with balanced vs. imbalanced data, and large-scale vs. small-scale data. From the perspective of dataset, we find that the overall performance of every PCP dataset has been significantly improved in this study compared to the previous study, with the percentage increase of the average AUC of all six datasets ranging from 7.9% (PCP-4) up to 17.0% (PCP-1). From the perspective of method, we find that three out of four methods significantly benefit from grid search comparing to their previously published non-grid search results, with the maximum AUC improvement as high as 47% (LASSO on PCP-5), 43.3% (NB on PCP-1), and 33.7% (SVM on PCP-6). SVM overall ranks number one, followed by KNN as the number two performer based on their average AUCs on all datasets. But these two also ranked the top two (KNN 76 days and SVM 15 days) in terms of the total running time that they need to do grid search. We also find that SVM, KNN, and DT tend to handle small-scale and imbalanced datasets better, while LR, and LASSO are doing well with large-scale and balanced datasets. NB is more sensitive to data imbalance while less sensitive to the size of a dataset. Conclusions Our results show that using grid search has improved the performance of ubiquitination prediction significantly. We find that the performance of a method is closely related to its hyperparameter setting and the type of data it handles. Even though SVM is on average an outperformer, none of the methods can provide the best performance for all datasets. When sufficient computing resources are well accessible, grid search is an effective way to identify both a top performing model for a machine learning method and a suitable machine learning method for a particular dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.