Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics

Iqbal, Muhammad; Faye, Ibrahima; Belhaouari, Samir Brahim; Said, Abas Md

doi:10.1155/2014/173869

Cited by 25 publications

(14 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Feature selection in the context of protein classification is conducted on the k-mers obtained from a protein sequence instead of the original one [see Leslie et al, 2004;Iqbal et al]. It can be observed that the RKHS H produced by the string kernel is finite dimensional and hence satisfies the regularity conditions on the RKHS trivially, and hence, the coordinates of the transformed space (the k-mers) can be used directly for feature selection.…”

Section: Case Study 3: Protein Classification With Mismatch String Kementioning

confidence: 99%

Feature elimination in kernel machines in moderately high dimensions

2019

View full text Add to dashboard Cite

We develop an approach for feature elimination in statistical learning with kernel machines, based on recursive elimination of features. We present theoretical properties of this method and show that it is uniformly consistent in finding the correct feature space under certain generalized assumptions. We present four case studies to show that the assumptions are met in most practical situations and present simulation results to demonstrate performance of the proposed approach. * The authors are grateful to the anonymous reviewers, the associate editor, and the editor for their helpful suggestions and comments.Also define the restricted space F J as follows:Definition 1. Let J be a set of indices J ⊆ {1, 2, .., d}. Then for a given functional space F, define F J = {g : g = f • π J c , ∀f ∈ F }, where π J c is the projection map that takes element x ∈ R d and maps it to x J ∈ R d , by substituting elements in x indexed in the set J, by zero.Remark 4. Note that we can subsequently define the space X J = {π J c (x) : x ∈ X }. Thus the above formulation allows us to create lower dimensional versions of a given functional space F.We are now ready for our feature selection method. The risk-RFE algorithm, defined for the parameters {λ n , δ n } is given as:Algorithm 1 (risk-RFE). Start off with J ≡ [·] empty and let Z ≡ [1, 2, ..., d].STEP 1: In the k th iteration, choose feature i k ∈ Z \ J which minimize R reg,λn

show abstract

Section: Case Study 3: Protein Classification With Mismatch String Kementioning

confidence: 99%

Feature elimination in kernel machines in moderately high dimensions

2019

View full text Add to dashboard Cite

show abstract

“…The n-grams are, in general, contiguous specific amino acid subsequences of length n. This concept is commonly used in text and natural language processing, but it has also been used in the context of protein analysis [15], [16], [17], [18], even by direct transposition of text classification methods for the classification of GPCRs [19].…”

Section: B the N-gram Gpcr Sequence Transformationmentioning

confidence: 99%

The extracellular N-terminal domain suffices to discriminate class C G Protein-Coupled Receptor subtypes from n-grams of their sequences

König

Alquézar

Vellido

et al. 2015

2015 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

Abstract-The investigation of protein functionality often relies on the knowledge of crystal 3-D structure. This structure is not always known or easily unravelled, which is the case of eukaryotic cell membrane proteins such as G Protein-Coupled Receptors (GPCRs) and specially of those of class C, which are the target of the current study. In the absence of information about tertiary or quaternary structures, functionality can be investigated from the primary structure, that is, from the amino acid sequence. In previous research, we found that the different subtypes of class C GPCRs could be discriminated with a high level of accuracy from the n-gram transformation of their complete primary sequences, using a method that combined twostage feature selection with kernel classifiers. This study aims at discovering whether subunits of the complete sequence retain such discrimination capabilities. We report experiments that show that the extracellular N-terminal domain of the receptor suffices to retain the classification accuracy of the complete sequence and that it does so using a reduced selection of n-grams whose length of up to five amino acids opens up an avenue for class C GPCR signature motif discovery.

show abstract

“…Muhammad Javed Iqbal et al [16] proposed a feature subset selection technique whereby the statistical significance of each feature of a superfamily from all other superfamilies is measured. This technique was applied on a protein sequence represented by a vector of 8420 features.…”

Section: Introductionmentioning

confidence: 99%

Predicting Cellular Protein localization Sites on Ecoli

Yonasi¹,

Nakasi²,

Singh³

2018

IJCA

View full text Add to dashboard Cite

Several Machine Learning Classification Techniques have been applied in predicting Protein Localization sites of E.coli using a number of techniques. However, research done is limited to no prediction of Localization sites of Proteins on Ecoli s minimal dataset with the most informative features obtained using different feature selection techniques. This study investigated several Machine learning Classification and Feature Selection Techniques as applied on Ecoli s minimal dataset. The implementation of classifiers aided in predicting localization sites of E.coli s minimal subset using its informative features obtained by feature selection techniques. Results were achieved in four parts including; (Data Collection, Cleaning and Preprocessing), Feature selection where the most informative features are selected, Classification where prediction of the localization of proteins is done and then Evaluation of the Classifiers to assess their performance using a number of measures including Accuracy from Cross-validation, and AUROCC to enable in recommending the best Classifier at the end. Among the Classifiers used, Extra Tree Classifier and Gradient Boosting are seen to be the best at performance followed by Random forest as seen from Precision, Recall and F-measure scores. AdaBoost is the worst at 83%.

show abstract

Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics

Cited by 25 publications

References 38 publications

Feature elimination in kernel machines in moderately high dimensions

Feature elimination in kernel machines in moderately high dimensions

The extracellular N-terminal domain suffices to discriminate class C G Protein-Coupled Receptor subtypes from n-grams of their sequences

Predicting Cellular Protein localization Sites on Ecoli

Contact Info

Product

Resources

About