Evaluating Representations for Gene Ontology Terms

Duong, Dat; Uppunda, Ankith; Ju, Chelsea J.‐T.; Zhang, James; Chen, Muhao; Eskin, Eleazar; Li, Jingyi Jessica; Chang, Kai-Wei

doi:10.1101/765644

Cited by 12 publications

(17 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To keep our software GOAT manageable for all users, we implement Transformer with only one head and 12 layers. We initialize the GO embedding E G as the pre-trained embedding BERT LAYER12 in [6] which transforms the GO definitions into vectors, where GOs having related definitions will have comparable vectors. Using pre-trained GO embeddings reduces the number of parameters in the Transformer, which can also reduce overfitting, use less GPU memory, and decrease run time.…”

Section: Resultsmentioning

confidence: 99%

“…We next evaluate whether our adaptation of Transformer can learn the co-occurrences of labels. Duong et al [6] noticed in their GO embedding that when one of the child-parent GO labels describes very broad biological events (e.g. low IC), then their vector representations may be far apart.…”

Section: Resultsmentioning

confidence: 99%

“…We suspect that Transformer can model co-occurrences of labels. To validate this, from the T-SNE plot of the vectors in the result section, we observe that a label and its ancestors will be nearby, even when their initial definition embeddings in [6] are dissimilar, as is the case when a term and its distant ancestors can have dissimilar definitions.…”

Section: Methodsmentioning

confidence: 95%

“…In this paper, we set the vectors represent the amino acids and GO labels to be in the same dimension; that is, E M and are vectors of the same size. To reduce the number of parameters, for the GO labels, we fix E G as the definition-based GO embeddings from [6] instead of setting it as a trainable parameter.…”

Section: Methodsmentioning

confidence: 99%

See 3 more Smart Citations

Annotating Gene Ontology terms for protein sequences with the Transformer model

Duong

Gai

Uppunda

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

7Predicting functions for novel amino acid sequences is a long-standing research problem. The 8 Uniprot database which contains protein sequences annotated with Gene Ontology (GO) terms, 9 is one commonly used training dataset for this problem. Predicting protein functions can then 10 be viewed as a multi-label classification problem where the input is an amino acid sequence and 11 the output is a set of GO terms. Recently, deep convolutional neural network (CNN) models have 12 been introduced to annotate GO terms for protein sequences. However, the CNN architecture 13 can only model close-range interactions between amino acids in a sequence. In this paper, first, 14 we build a novel GO annotation model based on the Transformer neural network. Unlike the 15 CNN architecture, the Transformer models all pairwise interactions for the amino acids within a 16 sequence, and so can capture more relevant information from the sequences. Indeed, we show 17 that our adaptation of Transformer yields higher classification accuracy when compared to the 18 recent CNN-based method DeepGO. Second, we modify our model to take motifs in the protein 19 sequences found by BLAST as additional input features. Our strategy is different from other 20 ensemble approaches that average the outcomes of BLAST-based and machine learning predictors. 21 Third, we integrate into our Transformer the metadata about the protein sequences such as 3D 22 structure and protein-protein interaction (PPI) data. We show that such information can greatly 23 improve the prediction accuracy, especially for rare GO labels. 24 1 Introduction 25 Predicting protein functions is an important task in computational biology. With the cost of 26 sequencing continuing to decrease, the gap between the numbers of labeled and unlabeled 27 sequences continues to grow [18]. Protein functions are described by Gene Ontology (GO) 28 terms [16]. Predicting protein functions is a multi-label classification problem where the 29 input is an amino acid sequence and the output is a set of GO terms. GO terms are 30 organized into a hierarchical tree, where generic terms (e.g. cellular anatomical entity) 31 are parents of specific terms (e.g. perforation plate). Due to this tree structure, if a GO 32 term is assigned to a protein, then all its ancestors are also assigned to this same protein. 33 When analyzing only the amino acid sequence data to predict protein functions, there 34 are two major trends. The first trend relies on string-matching models like Basic Local 35 Alignment Search Tool (BLAST) to match the unknown sequence with labeled proteins 36 in the database [11]. Recently, Zhang et al. [18] combined BLAST with Position-Specific 37 Iterative Basic Local Alignment Search Tool (PSI-BLAST) to retrieve even more labeled 38 1 proteins which are possibly related to the unknown sequence. The key idea behind 39BLAST methods is to retrieve proteins that resemble the unknown sequence; most likely, 40 these retrieved proteins will contain similar evolutionarily co...

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 95%

Section: Methodsmentioning

confidence: 99%

See 2 more Smart Citations

Annotating Gene Ontology terms for protein sequences with the Transformer model

Duong

Gai

Uppunda

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Several such representations have been proposed, one of the first ones being clusDCA [41], which used random walks to learn features that reflect the GO graph topology. More recent approaches make use of advances in Natural Language Processing (NLP) to learn embeddings that reflect semantic meanings based on the term names and/or descriptions [42]. Theoretical work has shown the utility of embedding graph-structured data (as GO terms are) in hyperbolic rather than Euclidean spaces [43].…”

Section: Function Representationmentioning

confidence: 99%

Automatic Gene Function Prediction in the 2020’s

Makrodimitris

Ham

Reinders

2020

Genes

View full text Add to dashboard Cite

The current rate at which new DNA and protein sequences are being generated is too fast to experimentally discover the functions of those sequences, emphasizing the need for accurate Automatic Function Prediction (AFP) methods. AFP has been an active and growing research field for decades and has made considerable progress in that time. However, it is certainly not solved. In this paper, we describe challenges that the AFP field still has to overcome in the future to increase its applicability. The challenges we consider are how to: (1) include condition-specific functional annotation, (2) predict functions for non-model species, (3) include new informative data sources, (4) deal with the biases of Gene Ontology (GO) annotations, and (5) maximally exploit the GO to obtain performance gains. We also provide recommendations for addressing those challenges, by adapting (1) the way we represent proteins and genes, (2) the way we represent gene functions, and (3) the algorithms that perform the prediction from gene to function. Together, we show that AFP is still a vibrant research area that can benefit from continuing advances in machine learning with which AFP in the 2020s can again take a large step forward reinforcing the power of computational biology.

show abstract

Representation learning applications in biological sequence analysis

Iuchi

Matsutani

Yamada

et al. 2021

Computational and Structural Biotechnology Journal

View full text Add to dashboard Cite

Although remarkable advances have been reported in high-throughput sequencing, the ability to aptly analyze a substantial amount of rapidly generated biological (DNA/RNA/protein) sequencing data remains a critical hurdle. To tackle this issue, the application of natural language processing (NLP) to biological sequence analysis has received increased attention. In this method, biological sequences are regarded as sentences while the single nucleic acids/amino acids or k-mers in these sequences represent the words. Embedding is an essential step in NLP, which performs the conversion of these words into vectors. Specifically, representation learning is an approach used for this transformation process, which can be applied to biological sequences. Vectorized biological sequences can then be applied for function and structure estimation, or as input for other probabilistic models. Considering the importance and growing trend for the application of representation learning to biological research, in the present study, we have reviewed the existing knowledge in representation learning for biological sequence analysis.

show abstract

Evaluating Representations for Gene Ontology Terms

Cited by 12 publications

References 27 publications

Annotating Gene Ontology terms for protein sequences with the Transformer model

Annotating Gene Ontology terms for protein sequences with the Transformer model

Automatic Gene Function Prediction in the 2020’s

Representation learning applications in biological sequence analysis

Contact Info

Product

Resources

About