Multi-label proteins can participate in carrier transportation, enzyme catalysis, hormone regulation and other life activities. Meanwhile, they play a key role in the fields of biopharmaceuticals, gene and cell therapy. This article proposes a prediction method called Mps-mvRBRL to predict the subcellular localization (SCL) of multi-label protein. Firstly, pseudo position-specific scoring matrix, dipeptide composition, position specific scoring matrix-transition probability composition, gene ontology and pseudo amino acid composition algorithms are used to obtain numerical information from different views. Based on the contribution of five individual feature extraction methods, differential evolution is used for the first time to learn the weight of single feature, and then these original features use a weighted combination method to fuse multi-view information. Secondly, the fused high-dimensional features use a weighted linear discriminant analysis framework based on binary weight form to eliminate irrelevant information. Finally, the best feature vector is input into the joint ranking support vector machine and binary relevance with robust low-rank learning classifier to predict the SCL. After applying leave-one-out cross-validation, the overall actual accuracy (OAA) and overall location accuracy (OLA) of Mps-mvRBRL on the training set of Gram-positive bacteria are both 99.81%. The OAA on the test sets of plant, virus and Gram-negative bacteria datasets are 97.24%, 98.55% and 98.20%, respectively, and the OLA are 97.16%, 97.62% and 98.28%, respectively. The results show that the model achieves good prediction performance for predicting the SCL of multi-label protein.
Motivation
Multi-label protein subcellular localization (SCL) is an indispensable way to study protein function. It can locate a certain protein (such as the human transmembrane protein that promotes the invasion of the SARS-CoV-2) or expression product at a specific location in a cell, which can provide a reference for clinical treatment of diseases such as COVID-19.
Results
The paper proposes a novel method named ML-locMLFE. First of all, six feature extraction methods are adopted to obtain protein effective information. These methods include pseudo amino acid composition (PseAAC), encoding based on grouped weight (EBGW), gene ontology (GO), multi-scale continuous and discontinuous (MCD), residue probing transformation (RPT) and evolutionary distance transformation (EDT). In the next part, we utilize the multi-label information latent semantic index (MLSI) method to avoid the interference of redundant information. In the end, multi-label learning with feature induced labeling information enrichment (MLFE) is adopted to predict the multi-label protein SCL. The Gram-positive bacteria dataset is chosen as a training set, while the Gram-negative bacteria dataset, virus dataset, newPlant dataset and SARS-CoV-2 dataset as the test sets. The overall actual accuracy (OAA) of the first four datasets is 99.23%, 93.82%, 93.24%, and 96.72% by the leave-one-out cross validation (LOOCV). It is worth mentioning that the OAA prediction result of our predictor on the SARS-CoV-2 dataset is 72.73%. The results indicate that the ML-locMLFE method has obvious advantages in predicting the SCL of multi-label protein, which provides new ideas for further research on the SCL of multi-label protein.
Availability and implementation
The source codes and data are publicly available at https://github.com/QUST-AIBBDRC/ML-locMLFE/.
Supplementary information
Supplementary data are available at Bioinformatics online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.