Sequence representations and their utility for predicting protein-protein interactions

Kimothi, Dhananjay; Biyani, Pravesh; Hogan, James M.

doi:10.1101/2019.12.31.890699

Cited by 2 publications

(2 citation statements)

References 46 publications

(60 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another embedding method, doc2vec 42 includes the whole context to some extent and performs better than word2vec on selected tasks. Several methods use doc2vec to represent proteins 5,27,[97][98][99][100][101] . Also, deep language models, such as BERT 91 and ELMO 46 were originally developed for NLP, and later employed for protein representations 23,28 .…”

Section: Different Approaches For Representing Proteinsmentioning

confidence: 99%

Evaluation of Methods for Protein Representation Learning: A Quantitative Analysis

Unsal

Ataş

Albayrak

et al. 2020

Preprint

View full text Add to dashboard Cite

Data-centric approaches have been utilized to develop predictive methods for elucidating uncharacterized aspects of proteins such as their functions, biophysical properties, subcellular locations and interactions. However, studies indicate that the performance of these methods should be further improved to effectively solve complex problems in biomedicine and biotechnology. A data representation method can be defined as an algorithm that calculates numerical feature vectors for samples in a dataset, to be later used in quantitative modelling tasks. Data representation learning methods do this by training and using a model that employs statistical and machine/deep learning algorithms. These novel methods mostly take inspiration from the data-driven language models that have yielded ground-breaking improvements in the field of natural language processing. Lately, these learned data representations have been applied to the field of protein informatics and have displayed highly promising results in terms of extracting complex traits of proteins regarding sequence-structure-function relations. In this study, we conducted a detailed investigation over protein representation learning methods, by first categorizing and explaining each approach, and then conducting benchmark analyses on; (i) inferring semantic similarities between proteins, (ii) predicting ontology-based protein functions, and (iii) classifying drug target protein families. We examine the advantages and disadvantages of each representation approach over the benchmark results. Finally, we discuss current challenges and suggest future directions. We believe the conclusions of this study will help researchers in applying machine/deep learning-based representation techniques on protein data for various types of predictive tasks. Furthermore, we hope it will demonstrate the potential of machine learning-based data representations for protein science and inspire the development of novel methods/tools to be utilized in the fields of biomedicine and biotechnology.

show abstract

Section: Different Approaches For Representing Proteinsmentioning

confidence: 99%

Evaluation of Methods for Protein Representation Learning: A Quantitative Analysis

Unsal

Ataş

Albayrak

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Asgari et al proposed BioVec based on the skip-gram model for biological sequences representation ( Asgari and Mofrad, 2015 ). Kimothi et al developed a model named seq2vec based on doc2vec, which is an extension of the original word2vec ( Kimothi et al, 2016 ). The dna2vec model is dedicated to representing variable-length words ( Ng, 2017a ).…”

Section: Introductionmentioning

confidence: 99%

i5hmCVec: Identifying 5-Hydroxymethylcytosine Sites of Drosophila RNA Using Sequence Feature Embeddings

Liu

2022

Front. Genet.

View full text Add to dashboard Cite

5-Hydroxymethylcytosine (5hmC), one of the most important RNA modifications, plays an important role in many biological processes. Accurately identifying RNA modification sites helps understand the function of RNA modification. In this work, we propose a computational method for identifying 5hmC-modified regions using machine learning algorithms. We applied a sequence feature embedding method based on the dna2vec algorithm to represent the RNA sequence. The results showed that the performance of our model is better that of than state-of-art methods. All dataset and source codes used in this study are available at: https://github.com/liu-h-y/5hmC_model.

show abstract

Sequence representations and their utility for predicting protein-protein interactions

Cited by 2 publications

References 46 publications

Evaluation of Methods for Protein Representation Learning: A Quantitative Analysis

Evaluation of Methods for Protein Representation Learning: A Quantitative Analysis

i5hmCVec: Identifying 5-Hydroxymethylcytosine Sites of Drosophila RNA Using Sequence Feature Embeddings

Contact Info

Product

Resources

About