2019
DOI: 10.3389/fpls.2018.01961
|View full text |Cite
|
Sign up to set email alerts
|

Identifying Plant Pentatricopeptide Repeat Coding Gene/Protein Using Mixed Feature Extraction Methods

Abstract: Motivation: Pentatricopeptide repeat (PPR) is a triangular pentapeptide repeat domain that plays a vital role in plant growth. In this study, we seek to identify PPR coding genes and proteins using a mixture of feature extraction methods. We use four single feature extraction methods focusing on the sequence, physical, and chemical properties as well as the amino acid composition, and mix the features. The Max-Relevant-Max-Distance (MRMD) technique is applied to reduce the feature dimension. Classification use… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
17
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 13 publications
(17 citation statements)
references
References 80 publications
0
17
0
Order By: Relevance
“…To quantitatively identify proteins, the physicochemical characteristics were obtained using a method (temporarily called 188d), which could extract sequence information and amino acid properties (Song et al, 2014;Xu et al, 2014Xu et al, , 2018Fu et al, 2019;Liu, 2019;Zhu et al, 2019). The first 20 elements in the results of this method denoted the frequency of the 20 original amino acids (Zhu et al, 2019); the next 24 features reflected the group proportion corresponding to three groups (Qu et al, 2019); the following 120 dimensions were the distributions of three groups in five local positions (Cai et al, 2003); the last 24 features were the numbers of three types of dipeptides.…”
Section: Physicochemical Characteristicsmentioning
confidence: 99%
See 1 more Smart Citation
“…To quantitatively identify proteins, the physicochemical characteristics were obtained using a method (temporarily called 188d), which could extract sequence information and amino acid properties (Song et al, 2014;Xu et al, 2014Xu et al, , 2018Fu et al, 2019;Liu, 2019;Zhu et al, 2019). The first 20 elements in the results of this method denoted the frequency of the 20 original amino acids (Zhu et al, 2019); the next 24 features reflected the group proportion corresponding to three groups (Qu et al, 2019); the following 120 dimensions were the distributions of three groups in five local positions (Cai et al, 2003); the last 24 features were the numbers of three types of dipeptides.…”
Section: Physicochemical Characteristicsmentioning
confidence: 99%
“…where i is a residue, Ldenotes the length of the whole protein sequence, S i,j represents the i-th property of the j-th amino acid, and S i reflects the mean value of the i-th property (Qu et al, 2019). In our experiment, the value of lg was set to 2.…”
Section: Accmentioning
confidence: 99%
“…Pentatricopeptide repeat (PPR), which is a 35-amino acid sequence motif (Chen et al, 2018 ; Rojas et al, 2018 ) and is commonly found in eukaryotes and terrestrial plants (Ruida et al, 2013 ), plays an important role in plant growth and development (Qu et al, 2019 ). PPR proteins, which are distinguished by the presence of tandem degenerate PPR motifs and by the relative lack of introns in the genes coding for them, are regarded as an ideal model to study plant cytoplasmic and nuclear interactions (Wang et al, 2008 ).…”
Section: Introductionmentioning
confidence: 99%
“…Feature extraction from protein sequences plays an important role in protein classification [1,2,3,4] of many areas, such as identification of plant pentatricopeptide repeat coding protein [5], prediction of bacterial type IV secreted effectors [6,7], identification of heat shock protein [8], prediction of mitochondrial proteins [9], etc. In general, prevailing encoding approaches of protein sequences for feature extraction include pseudo-amino acid composition (PseAAC) [8,9,10,11,12,13,14,15,16,17,18,19,20], position-specific scoring matrix (PSSM) [7,21,22,23,24,25,26,27,28,29,30], position-specific iterated blast (PSI-BLAST) [31,32,33,34,35] etc.…”
Section: Introductionmentioning
confidence: 99%
“…In other words, the encoding approach corresponding to the most accurate classification result should be considered. Prevailing classifiers including random forest or decision tree classifier (RF or DTC) [1,36], gradient boosting machine (GBM) [37,38], k-nearest-neighbor (kNN) [39,40], linear discriminant analysis (LDA) [41,42], logistic regression (LR) [43], multi-layer perceptron (MLP) [44,45], naive bayesian (NB) [5,46], support vector machine (SVM) [47,48] are credible.…”
Section: Introductionmentioning
confidence: 99%