Amino acid encoding schemes for machine learning methods

Zamani, Masood; Kremer, Stefan C.

doi:10.1109/bibmw.2011.6112394

Cited by 15 publications

(8 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The same input data were represented differently in order to select among different encoding methods and, therefore, each encoded input had a variable impact on machine learning measures. In computational biology, encoding of amino acids can be achieved by considering amino acids' physicochemical properties, for instance, using the BLOSUM substitution matrix, or by a generic character-wise encoding like one-hot or integer encoding used also in other ML domains (Zamani and Kremer, 2011).…”

Section: Encodingmentioning

confidence: 99%

Machine Learning Detects Anti-DENV Signatures in Antibody Repertoire Sequences

Horst

Smakaj

Natali

et al. 2021

Front. Artif. Intell.

View full text Add to dashboard Cite

Dengue infection is a global threat. As of today, there is no universal dengue fever treatment or vaccines unreservedly recommended by the World Health Organization. The investigation of the specific immune response to dengue virus would support antibody discovery as therapeutics for passive immunization and vaccine design. High-throughput sequencing enables the identification of the multitude of antibodies elicited in response to dengue infection at the sequence level. Artificial intelligence can mine the complex data generated and has the potential to uncover patterns in entire antibody repertoires and detect signatures distinctive of single virus-binding antibodies. However, these machine learning have not been harnessed to determine the immune response to dengue virus. In order to enable the application of machine learning, we have benchmarked existing methods for encoding biological and chemical knowledge as inputs and have investigated novel encoding techniques. We have applied different machine learning methods such as neural networks, random forests, and support vector machines and have investigated the parameter space to determine best performing algorithms for the detection and prediction of antibody patterns at the repertoire and antibody sequence levels in dengue-infected individuals. Our results show that immune response signatures to dengue are detectable both at the antibody repertoire and at the antibody sequence levels. By combining machine learning with phylogenies and network analysis, we generated novel sequences that present dengue-binding specific signatures. These results might aid further antibody discovery and support vaccine design.

show abstract

Section: Encodingmentioning

confidence: 99%

Machine Learning Detects Anti-DENV Signatures in Antibody Repertoire Sequences

Horst

Smakaj

Natali

et al. 2021

Front. Artif. Intell.

View full text Add to dashboard Cite

show abstract

“…After feature engineering, deep learning and machine learning algorithms can be applied to the extracted features to perform protein family classification. Yet, in order to apply these methods, sequences are needed to be converted to numerical representations since there is no such a method to perform artificial intelligence with raw protein sequences [7,8].…”

Section: Introductionmentioning

confidence: 99%

“…In the literature, there are limited methods for converting protein sequences into the numbers. In general, BLOSUM62 (BLOcks SUbstitution Matrix), PAM25 (Point Accepted Mutation), hydrophobicity, EIIP (Electron-Ion Interaction Potential) are applied and the performance of family classification is highly depending on the conversion method [7,9]. Recently, deep learning models are actively used in bioinformatics studies and show promising results.…”

Section: Introductionmentioning

confidence: 99%

A novel Fibonacci hash method for protein family identification by using recurrent neural networks

Alakuş¹,

Türkoğlu²

2021

Turk J Elec Eng & Comp Sci

View full text Add to dashboard Cite

Identification and classification of protein families are one of the most significant problem in bioinformatics and protein studies. It is essential to specify the family of a protein since, proteins are highly used in smart drug therapies, protein functions and in some case, phylogenetic trees. Some sequencing techniques provide researchers to identify the biological similarities of protein families and functions. Yet, determining these families with sequencing applications requires huge amount of time. Thus, it is needed a computer and artificial intelligence based classification system to save the time, and avoid complexity in protein classification process. In order to designate the protein families with computer-aided systems, protein sequences need to be converted to the numerical representations. In this paper, we provide a novel protein mapping method based on Fibonacci numbers and hashing table (FIBHASH). Each amino acid code is assigned to the Fibonacci numbers based on integer representations respectively. Later, these amino acid codes are inserted a hashing table with the size of 20 to be classified with recurrent neural networks. To determine the performance of the proposed mapping method, we used accuracy, f1-score, recall, precision, and AUC evaluation criteria. In addition, the results of evaluation metrics with other protein mapping techniques including EIIP, hydrophobicity, CPNR, Atchley factors, BLOSUM62, PAM250, binary one-hot encoding, and randomly encoded representations are compared. The proposed method showed a promising result with an accuracy of 92.77%, and 0.98 AUC score.

show abstract

“…In the past, several machine learning approaches have been developed for the classification of protein sequences into functional or structural existing superfamilies [ 16 , 19 – 22 ]. A superfamily is comprised of a set of proteins that possess sequence or structural homology.…”

Section: Introductionmentioning

confidence: 99%

Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics

Iqbal

Faye

Belhaouari

et al. 2014

The Scientific World Journal

View full text Add to dashboard Cite

Bioinformatics has been an emerging area of research for the last three decades. The ultimate aims of bioinformatics were to store and manage the biological data, and develop and analyze computational tools to enhance their understanding. The size of data accumulated under various sequencing projects is increasing exponentially, which presents difficulties for the experimental methods. To reduce the gap between newly sequenced protein and proteins with known functions, many computational techniques involving classification and clustering algorithms were proposed in the past. The classification of protein sequences into existing superfamilies is helpful in predicting the structure and function of large amount of newly discovered proteins. The existing classification results are unsatisfactory due to a huge size of features obtained through various feature encoding methods. In this work, a statistical metric-based feature selection technique has been proposed in order to reduce the size of the extracted feature vector. The proposed method of protein classification shows significant improvement in terms of performance measure metrics: accuracy, sensitivity, specificity, recall, F-measure, and so forth.

show abstract

Amino acid encoding schemes for machine learning methods

Cited by 15 publications

References 16 publications

Machine Learning Detects Anti-DENV Signatures in Antibody Repertoire Sequences

Machine Learning Detects Anti-DENV Signatures in Antibody Repertoire Sequences

A novel Fibonacci hash method for protein family identification by using recurrent neural networks

Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics

Contact Info

Product

Resources

About