The language of the protein universe

Scaiewicz, Andrea; Levitt, Michael

doi:10.1016/j.gde.2015.08.010

Cited by 32 publications

(23 citation statements)

References 77 publications

(72 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The protein universe is a concept corresponding to the set of all possible protein sequences that can be found in all living organisms 1 . Understanding how the current protein universe appeared and evolved are central questions in evolutionary molecular biology 2 3 4 . Moreover, deciphering the rules behind the apparition of new proteins and their conservation in an organism may lead to develop improved methods for protein design.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Exploring the dark foldable proteome by considering hydrophobic amino acids topology

Bitard‐Feildel

Callebaut

2017

Sci Rep

View full text Add to dashboard Cite

The protein universe corresponds to the set of all proteins found in all organisms. A way to explore it is by taking into account the domain content of the proteins. However, some part of sequences and many entire sequences remain un-annotated despite a converging number of domain families. The un-annotated part of the protein universe is referred to as the dark proteome and remains poorly characterized. In this study, we quantify the amount of foldable domains within the dark proteome by using the hydrophobic cluster analysis methodology. These un-annotated foldable domains were grouped using a combination of remote homology searches and domain annotations, leading to define different levels of darkness. The dark foldable domains were analyzed to understand what make them different from domains stored in databases and thus difficult to annotate. The un-annotated domains of the dark proteome universe display specific features relative to database domains: shorter length, non-canonical content and particular topology in hydrophobic residues, higher propensity for disorder, and a higher energy. These features make them hard to relate to known families. Based on these observations, we emphasize that domain annotation methodologies can still be improved to fully apprehend and decipher the molecular evolution of the protein universe.

show abstract

mentioning

confidence: 99%

“…It is well established that domains, i.e . those basic units from which proteins are made, are the essential blocks of protein evolution 4 9 10 . Two distinct definitions of a protein domain co-exist regarding the type of molecular data used for their identification.…”

mentioning

confidence: 99%

Exploring the dark foldable proteome by considering hydrophobic amino acids topology

Bitard‐Feildel

Callebaut

2017

Sci Rep

View full text Add to dashboard Cite

show abstract

“…For example (Chou and Cai, 2002) classified the cellular location and (Forslund and Sonnhammer, 2008;Doǧan et al, 2016) predicted the associated Gene Ontology (GO) terms. There exist two ways to represent domains; either by the linear order in a protein, domain architecture, (Scaiewicz and Levitt, 2015), or by a graph where nodes are domains and edges connect domains that co-exist in a protein (Moore et al, 2008;Forslund and Sonnhammer, 2012).…”

Section: Introductionmentioning

confidence: 99%

dom2vec: Unsupervised protein domain embeddings capture domains structure and function providing data-driven insights into collocations in domain architectures

Malone²,

Nejdl

2020

Preprint

View full text Add to dashboard Cite

Motivation: Word embedding approaches have revolutionized Natural Lange Processing (NLP) research. These approaches aim to map words to a low-dimensional vector space in which words with similar linguistic features are close in the vector space. These NLP approaches also preserve local linguistic features, such as analogy. Embedding-based approaches have also been developed for proteins. To date, such approaches treat amino acids as words, and proteins are treated as sentences of amino acids. These approaches have been evaluated either qualitatively, via visual inspection of the embedding space, or extrinsicaly, via performance on a downstream task. However, it is difficult to directly assess the intrinsic quality of the learned embeddings. Results: In this paper, we introduce dom2vec, an approach for learning protein domain embeddings. We also present four intrinsic evaluation strategies which directly assess the quality of protein domain embeddings. We leverage the hierarchy relationship of InterPro domains, known secondary structure classes, Enzyme Commision class information, and Gene Ontology annotations in these assessments. Importantly, these evaluations allow us to assess the quality of learned embeddings independently of a particular downstream task, similar to the local linguistic features used in traditional NLP. We also show that dom2vec embeddings are competitive with, or even outperform, state-of-the-art approaches on downstream tasks. Availability: The protein domain embeddings vectors and the entire code to reproduce the results will become available. Contact: melidis@l3s.uni-hannover.de

show abstract

“…A primary way of how proteins evolve is through rearrangement of their functional and structural units, known as protein domains [1,2]. The domains are main composition of a protein is by the linear order in a protein, domain architecture [6].…”

Section: Introductionmentioning

confidence: 99%

dom2vec: Capturing domain structure and function using self-supervision on protein domain architectures

Melidis¹,

Malone²,

Nejdl³

2020

Preprint

View full text Add to dashboard Cite

Background: Word embedding approaches have revolutionized natural language processing (NLP) research. These approaches aim to map words to a low-dimensional vector space, in which words with similar linguistic features cluster together. Embedding-based methods have also been developed for proteins, where words are amino acids and sentences are proteins. The learned embeddings have been evaluated qualitatively, via visual inspection of the embedding space and extrinsically, via performance comparison on downstream protein prediction tasks. However, these sequence embeddings have the caveat that biological metadata do not exist for each amino acid, in order to measure the quality of each unique learned embedding vector. Results: Here, we present dom2vec, an approach for learning protein domain embeddings using word2vec on InterPro annotations. In contrast to sequence embeddings, biological metadata do exist for protein domains, related to each domain separately. Therefore, we present four intrinsic evaluation strategies to quantitatively assess the quality of the learned embedding space. To perform a reliable evaluation in terms of biology knowledge, we selected the metadata related to the most distinctive biological characteristics of domains. These are the structure, enzymatic and molecular function of a given domain. Notably, dom2vec obtains adequate level of performance in the intrinsic assessment, therefore we can draw an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Moreover, we demonstrate the dom2vec applicability on protein prediction tasks, by comparing it with state-of-the-art sequence embeddings in three downstream tasks. We show that dom2vec outperform sequence embeddings for toxin and enzymatic function prediction and is comparable with sequence embeddings in cellular location prediction. Conclusions: We report that the application of word2vec on InterPro annotations produces domain embeddings with two significant advantages over sequence embeddings. First, each unique dom2vec vector can be quantitatively evaluated towards its available structure and function metadata. Second, the produced embeddings can outperform the sequence embeddings for a subset of downstream tasks. Overall, dom2vec embeddings are able to capture the most important biological properties of domains and surpass sequence embeddings for a subset of prediction tasks.

show abstract

The language of the protein universe

Cited by 32 publications

References 77 publications

Exploring the dark foldable proteome by considering hydrophobic amino acids topology

Exploring the dark foldable proteome by considering hydrophobic amino acids topology

dom2vec: Unsupervised protein domain embeddings capture domains structure and function providing data-driven insights into collocations in domain architectures

dom2vec: Capturing domain structure and function using self-supervision on protein domain architectures

Contact Info

Product

Resources

About