With the development of computer technology, many machine learning algorithms have been applied to the field of biology, forming the discipline of bioinformatics. Protein function prediction is a classic research topic in this subject area. Though many scholars have made achievements in identifying protein by different algorithms, they often extract a large number of feature types and use very complex classification methods to obtain little improvement in the classification effect, and this process is very time-consuming. In this research, we attempt to utilize as few features as possible to classify vesicular transportation proteins and to simultaneously obtain a comparative satisfactory classification result. We adopt CTDC which is a submethod of the method of composition, transition, and distribution (CTD) to extract only 39 features from each sequence, and LibSVM is used as the classification method. We use the SMOTE method to deal with the problem of dataset imbalance. There are 11619 protein sequences in our dataset. We selected 4428 sequences to train our classification model and selected other 1832 sequences from our dataset to test the classification effect and finally achieved an accuracy of 71.77%. After dimension reduction by MRMD, the accuracy is 72.16%.
Enzymes, as a group of crucial biocatalysts produced by living cells, enable the chemical reactions in organisms to be more efficient. According to the properties of the reactions catalyzed by enzymes, the Enzyme Commission (EC) number system divided enzymes into 6 primary main classes in 1961: oxidoreductases (EC1), transferases (EC2), hydrolases (EC3), lyases (EC4), isomerases (EC5), and ligases (EC6). These six categories did not change for many years until a new class, the translocases (EC7), was added in August 2018. Different enzymes have different properties of catalytic reaction, and the prediction of enzyme classes is a very important research topic, allowing us to further study the structure and function of enzyme molecules when we know the category of enzyme. Because the number of enzymes whose function remains unknown is enormous, it is time-consuming to use biological experiments to determine enzyme characteristics. Thus, devising various computational models to predict enzyme classes has become a feasible scheme. In hope of giving researchers more inspiration and ideas for predicting the EC number of enzymes by machine learning, we summarize a variety of research methods used in the prediction of enzyme families in this research.INDEX TERMS Commission, enzyme classification, machine learning, bioinformatics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.