Prediction of Golgi-Resident Protein Types Using Computational Method

Lin, Hao; Ding, Hui; Chen, Wei

doi:10.2174/9781608058624114010011

Cited by 1 publication

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Support vector machine (SVM) (Ding et al, 2011, 2013; Feng et al, 2013; Lin et al, 2014; Jiao and Du, 2016a,b; Zeng et al, 2017; Rahman et al, 2018; Chen et al, 2019; Dao et al, 2019; Liu B. et al, 2019), K-nearest neighbor (KNN) (Ahmad et al, 2017; Ahmad and Hayat, 2019), and random forests (RF) (Yang R. et al, 2016; Pan et al, 2017; Ru et al, 2019; Su et al, 2019; Zheng et al, 2019) classifiers have been used to identify sub-Golgi proteins and for other fields. In this study, RF was selected for modeling because it is a powerful machine-learning tool and facilitates analysis of feature importance.…”

Section: Methodsmentioning

confidence: 99%

“…In the past few years, several protein subcellular locations and protein type prediction tools, including sub-Golgi protein identification tools (Teasdale and Yuan, 2002; Van Dijk et al, 2008; Chou et al, 2010; Ding et al, 2011, 2013; Jiao et al, 2014; Lin et al, 2014; Nikolovski et al, 2014; Jiao and Du, 2016a,b; Yang R. et al, 2016; Ahmad et al, 2017; Wang et al, 2017; Rahman et al, 2018; Ahmad and Hayat, 2019; Wuritu et al, 2019), have been developed using various machine learning algorithms, including increment diversity Mahalanobis discriminant (IDMD) (Ding et al, 2011), support vector machine (SVM) (Ding et al, 2013, 2017; Jiao et al, 2014; Lin et al, 2014; Jiao and Du, 2016a,b), random forest (RF) (Ding et al, 2016a,b; Yang R. et al, 2016; Yu et al, 2017; Liu et al, 2018), and K nearest neighbor algorithm (KNN) (Ahmad et al, 2017; Ahmad and Hayat, 2019), among others. To generate feature vectors for sub-Golgi protein identification, protein amino acid composition (AAC) (Rahman et al, 2018), k-gapped dipeptide composition (k-gapDC) (Ding et al, 2011, 2013), pseudo amino acid composition (PseAAC) (Jiao et al, 2014; Liu et al, 2015), and protein sequences evolutionary information (e.g., position-specific scoring matrix, PSSM) and their derivative features (Yang et al, 2014; Jiao and Du, 2016a,b; Yang R. et al, 2016; Ahmad et al, 2017; Rahman et al, 2018) have been used.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features

Jin

Ding

et al. 2019

Front. Bioeng. Biotechnol.

104

View full text Add to dashboard Cite

To gain insight into the malfunction of the Golgi apparatus and its relationship to various genetic and neurodegenerative diseases, the identification of sub-Golgi proteins, both cis-Golgi and trans-Golgi proteins, is of great significance. In this study, a state-of-art random forests sub-Golgi protein classifier, rfGPT, was developed. The rfGPT used 2-gap dipeptide and split amino acid composition for the feature vectors and was combined with the synthetic minority over-sampling technique (SMOTE) and an analysis of variance (ANOVA) feature selection method. The rfGPT was trained on a sub-Golgi protein sequence data set (137 sequences), with sequence identity less than 25%. For the optimal rfGPT classifier with 93 features, the accuracy (ACC) was 90.5%; the Matthews correlation coefficient (MCC) was 0.811; the sensitivity (Sn) was 92.6%; and the specificity (Sp) was 88.4%. The independent testing scores for the rfGPT were ACC = 90.6%; MCC = 0.696; Sn = 96.1%; and Sp = 69.2%. Although the independent testing accuracy was 4.4% lower than that for the best reported sub-Golgi classifier trained on a data set with 40% sequence identity (304 sequences), the rfGPT is currently the top sub-Golgi protein predictor utilizing feature vectors without any position-specific scoring matrix and its derivative features. Therefore, the rfGPT is a more practical tool, because no sequence alignment is required with tens of millions of protein sequences. To date, the rfGPT is the Golgi classifier with the best independent testing scores, optimized by training on smaller benchmark data sets. Feature importance analysis proves that the non-polar and aliphatic residues composition, the (aromatic residues) + (non-polar, aliphatic residues) dipeptide and aromatic residues composition between NH2-termial and COOH-terminal of protein sequences are the three top biological features for distinguishing the sub-Golgi proteins.

show abstract