Identifying Plant Pentatricopeptide Repeat Coding Gene/Protein Using Mixed Feature Extraction Methods

Qu, Kaiyang; Wei, Leyi; Yu, Jinxin; Wang, Chunyu

doi:10.3389/fpls.2018.01961

Cited by 13 publications

(17 citation statements)

References 80 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To quantitatively identify proteins, the physicochemical characteristics were obtained using a method (temporarily called 188d), which could extract sequence information and amino acid properties (Song et al, 2014;Xu et al, 2014Xu et al, , 2018Fu et al, 2019;Liu, 2019;Zhu et al, 2019). The first 20 elements in the results of this method denoted the frequency of the 20 original amino acids (Zhu et al, 2019); the next 24 features reflected the group proportion corresponding to three groups (Qu et al, 2019); the following 120 dimensions were the distributions of three groups in five local positions (Cai et al, 2003); the last 24 features were the numbers of three types of dipeptides.…”

Section: Physicochemical Characteristicsmentioning

confidence: 99%

See 1 more Smart Citation

A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features

Feng

Dan

et al. 2020

Front. Bioeng. Biotechnol.

View full text Add to dashboard Cite

The thermostability of proteins is a key factor considered during enzyme engineering, and finding a method that can identify thermophilic and non-thermophilic proteins will be helpful for enzyme design. In this study, we established a novel method combining mixed features and machine learning to achieve this recognition task. In this method, an amino acid reduction scheme was adopted to recode the amino acid sequence. Then, the physicochemical characteristics, auto-cross covariance (ACC), and reduced dipeptides were calculated and integrated to form a mixed feature set, which was processed using correlation analysis, feature selection, and principal component analysis (PCA) to remove redundant information. Finally, four machine learning methods and a dataset containing 500 random observations out of 915 thermophilic proteins and 500 random samples out of 793 non-thermophilic proteins were used to train and predict the data. The experimental results showed that 98.2% of thermophilic and non-thermophilic proteins were correctly identified using 10-fold cross-validation. Moreover, our analysis of the final reserved features and removed features yielded information about the crucial, unimportant and insensitive elements, it also provided essential information for enzyme design.

show abstract

Section: Physicochemical Characteristicsmentioning

confidence: 99%

“…where i is a residue, Ldenotes the length of the whole protein sequence, S i,j represents the i-th property of the j-th amino acid, and S i reflects the mean value of the i-th property (Qu et al, 2019). In our experiment, the value of lg was set to 2.…”

Section: Accmentioning

confidence: 99%

A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features

Feng

Dan

et al. 2020

Front. Bioeng. Biotechnol.

View full text Add to dashboard Cite

show abstract

“…Pentatricopeptide repeat (PPR), which is a 35-amino acid sequence motif (Chen et al, 2018 ; Rojas et al, 2018 ) and is commonly found in eukaryotes and terrestrial plants (Ruida et al, 2013 ), plays an important role in plant growth and development (Qu et al, 2019 ). PPR proteins, which are distinguished by the presence of tandem degenerate PPR motifs and by the relative lack of introns in the genes coding for them, are regarded as an ideal model to study plant cytoplasmic and nuclear interactions (Wang et al, 2008 ).…”

Section: Introductionmentioning

confidence: 99%

Identifying Plant Pentatricopeptide Repeat Proteins Using a Variable Selection Method

Zhao

Wang

et al. 2021

Front. Plant Sci.

View full text Add to dashboard Cite

Motivation: Pentatricopeptide repeat (PPR), which is a triangular pentapeptide repeat domain, plays an important role in plant growth. Features extracted from sequences are applicable to PPR protein identification using certain classification methods. However, which components of a multidimensional feature (namely variables) are more effective for protein discrimination has never been discussed. Therefore, we seek to select variables from a multidimensional feature for identifying PPR proteins.Method: A framework of variable selection for identifying PPR proteins is proposed. Samples representing PPR positive proteins and negative ones are equally split into a training and a testing set. Variable importance is regarded as scores derived from an iteration of resampling, training, and scoring step on the training set. A model selection method based on Gaussian mixture model is applied to automatic choice of variables which are effective to identify PPR proteins. Measurements are used on the testing set to show the effectiveness of the selected variables.Results: Certain variables other than the multidimensional feature they belong to do work for discrimination between PPR positive proteins and those negative ones. In addition, the content of methionine may play an important role in predicting PPR proteins.

show abstract

“…Feature extraction from protein sequences plays an important role in protein classification [1,2,3,4] of many areas, such as identification of plant pentatricopeptide repeat coding protein [5], prediction of bacterial type IV secreted effectors [6,7], identification of heat shock protein [8], prediction of mitochondrial proteins [9], etc. In general, prevailing encoding approaches of protein sequences for feature extraction include pseudo-amino acid composition (PseAAC) [8,9,10,11,12,13,14,15,16,17,18,19,20], position-specific scoring matrix (PSSM) [7,21,22,23,24,25,26,27,28,29,30], position-specific iterated blast (PSI-BLAST) [31,32,33,34,35] etc.…”

Section: Introductionmentioning

confidence: 99%

“…In other words, the encoding approach corresponding to the most accurate classification result should be considered. Prevailing classifiers including random forest or decision tree classifier (RF or DTC) [1,36], gradient boosting machine (GBM) [37,38], k-nearest-neighbor (kNN) [39,40], linear discriminant analysis (LDA) [41,42], logistic regression (LR) [43], multi-layer perceptron (MLP) [44,45], naive bayesian (NB) [5,46], support vector machine (SVM) [47,48] are credible.…”

Section: Introductionmentioning

confidence: 99%

Variable Selection from a Feature Representing Protein Sequences: A Case of Classification on Bacterial Type IV Secreted Effectors

Zhang

et al. 2020

Preprint

View full text Add to dashboard Cite

Background: Classification of certain proteins with specific functions is momentous for biological research. Encoding approaches of protein sequences for feature extraction play an important role in protein classification. Many computational methods (namely classifiers) are used for classification on protein sequences according to various encoding approaches. Commonly, protein sequences keep certain labels corresponding to different categories of biological functions (e.g., bacterial type IV secreted effectors or not), which makes protein prediction a fantasy. As to protein prediction, a kernel set of protein sequences keeping certain labels certified by biological experiments should be existent in advance. However, it has been hardly ever seen in prevailing researches. Therefore, unsupervised learning rather than supervised learning (e.g. classification) should be considered. As to protein classification, various classifiers may help to evaluate the effectiveness of different encoding approaches. Besides, variable selection from an encoded feature representing protein sequences is an important issue that also needs to be considered.Results: Focusing on the latter problem, we propose a new method for variable selection from an encoded feature representing protein sequences. Taking a benchmark dataset as a case, experiments are made to identify bacterial type IV secreted effectors from protein sequences, which indicates the effectiveness of our method. Conclusions: Certain variables other than an encoded feature they belong to do work for discrimination between different types of proteins. In addition, ensemble classifiers with an automatic assignment of different base classifiers do achieve a better classification result.

show abstract

Identifying Plant Pentatricopeptide Repeat Coding Gene/Protein Using Mixed Feature Extraction Methods

Cited by 13 publications

References 80 publications

A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features

A Method for Prediction of Thermophilic Protein Based on Reduced Amino Acids and Mixed Features

Identifying Plant Pentatricopeptide Repeat Proteins Using a Variable Selection Method

Variable Selection from a Feature Representing Protein Sequences: A Case of Classification on Bacterial Type IV Secreted Effectors

Contact Info

Product

Resources

About