Motivation
Multi-label protein subcellular localization (SCL) is an indispensable way to study protein function. It can locate a certain protein (such as the human transmembrane protein that promotes the invasion of the SARS-CoV-2) or expression product at a specific location in a cell, which can provide a reference for clinical treatment of diseases such as COVID-19.
Results
The paper proposes a novel method named ML-locMLFE. First of all, six feature extraction methods are adopted to obtain protein effective information. These methods include pseudo amino acid composition (PseAAC), encoding based on grouped weight (EBGW), gene ontology (GO), multi-scale continuous and discontinuous (MCD), residue probing transformation (RPT) and evolutionary distance transformation (EDT). In the next part, we utilize the multi-label information latent semantic index (MLSI) method to avoid the interference of redundant information. In the end, multi-label learning with feature induced labeling information enrichment (MLFE) is adopted to predict the multi-label protein SCL. The Gram-positive bacteria dataset is chosen as a training set, while the Gram-negative bacteria dataset, virus dataset, newPlant dataset and SARS-CoV-2 dataset as the test sets. The overall actual accuracy (OAA) of the first four datasets is 99.23%, 93.82%, 93.24%, and 96.72% by the leave-one-out cross validation (LOOCV). It is worth mentioning that the OAA prediction result of our predictor on the SARS-CoV-2 dataset is 72.73%. The results indicate that the ML-locMLFE method has obvious advantages in predicting the SCL of multi-label protein, which provides new ideas for further research on the SCL of multi-label protein.
Availability and implementation
The source codes and data are publicly available at https://github.com/QUST-AIBBDRC/ML-locMLFE/.
Supplementary information
Supplementary data are available at Bioinformatics online.