Annotating Gene Ontology terms for protein sequences with the Transformer model

Duong, Dat; Gai, Lisa; Uppunda, Ankith; Le, D. Q.; Eskin, Eleazar; Li, Jingyi Jessica; Chang, Kai-Wei

doi:10.1101/2020.01.31.929604

Cited by 9 publications

(5 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our contributions are as follows. First, TALE replaces previously-used convolutional neural networks (CNN) with self-attention-based transformers (Vaswani et al, 2017) which has made a major breakthrough in natural language processing and recently in protein sequence embedding (Rives et al, 2019;Duong et al, 2020;Elnaggar et al, 2020). Compared to CNN, transformers can deal with global dependencies within the sequence in just one layer, which helps detect global sequence patterns for function prediction much easier than CNN-based methods do.…”

Section: Introductionmentioning

confidence: 99%

TALE: Transformer-based protein function Annotation with joint sequence–Label Embedding

Cao

Shen

2021

Bioinformatics

View full text Add to dashboard Cite

Motivation Facing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a high-throughput alternative to experimental approaches. However, current methods can have limited applicability while relying on protein data besides sequences, or lack generalizability to novel sequences, species and functions. Results To overcome aforementioned barriers in applicability and generalizability, we propose a novel deep learning model using only sequence information for proteins, named Transformer-based protein function Annotation through joint sequence–Label Embedding (TALE). For generalizability to novel sequences we use self attention-based transformers to capture global patterns in sequences. For generalizability to unseen or rarely seen functions (tail labels), we embed protein function labels (hierarchical GO terms on directed graphs) together with inputs/features (1D sequences) in a joint latent space. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. It even outperformed a state-of-the-art method using network information besides sequence, in two of the three gene ontologies. Furthermore, TALE and TALE+ showed superior generalizability to proteins of low similarity, new species, or rarely annotated functions compared to training data, revealing deep insights into the protein sequence–function relationship. Ablation studies elucidated contributions of algorithmic components toward the accuracy and the generalizability. Availability The data, source codes and models are available at https://github.com/Shen-Lab/TALE Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Section: Introductionmentioning

confidence: 99%

TALE: Transformer-based protein function Annotation with joint sequence–Label Embedding

Cao

Shen

2021

Bioinformatics

View full text Add to dashboard Cite

show abstract

“…TALE ( Cao and Shen, 2021 ) employed a transformer-based deep-learning model with a joint embedding of sequence inputs and hierarchical function labels. While it remains a great challenge regarding how to effectively capture the GO term inter-relationships, a recent study ( Duong et al , 2020 ) shows that incorporating the hierarchical structure of GO graphs can enable the annotation model to emphasize on the GO label distribution, thereby benefitting the final prediction.…”

Section: Introductionmentioning

confidence: 99%

PFresGO: an attention mechanism-based deep-learning approach for protein annotation by integrating gene ontology inter-relationships

Pan

et al. 2023

Bioinformatics

View full text Add to dashboard Cite

Motivation The rapid accumulation of high-throughput sequence data demands the development of effective and efficient data-driven computational methods to functionally annotate proteins. However, most current approaches used for functional annotation simply focus on the use of protein-level information but ignore inter-relationships among annotations. Results Here, we established PFresGO, an attention-based deep-learning approach that incorporates hierarchical structures in Gene Ontology (GO) graphs and advances in natural language processing algorithms for the functional annotation of proteins. PFresGO employs a self-attention operation to capture the inter-relationships of GO terms, updates its embedding accordingly, and uses a cross-attention operation to project protein representations and GO embedding into a common latent space to identify global protein sequence patterns and local functional residues. We demonstrate that PFresGO consistently achieves superior performance across GO categories when compared with “state-of-the-art” methods. Importantly, we show that PFresGO can identify functionally important residues in protein sequences by assessing the distribution of attention weightings. PFresGO should serve as an effective tool for the accurate functional annotation of proteins and functional domains within proteins. Availability PFresGO is available for academic purposes at https://github.com/BioColLab/PFresGO. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

“…There is a dire need for efficient and accurate protein function annotation tools for the community to study the growing sequence databases [2][3][4]. Many computational methods have been developed to annotate protein functions based on primary sequences [5][6][7][8][9], protein family and domain annotations [10][11][12], protein-protein interaction(PPI) networks [7,9,10], and other hand-crafted features [8,9,13]. Critical Assessment of Functional Annotation(CAFA), a community-driven benchmark effort for automated protein function annotation, has shown that integrative prediction methods that combine multiple information sources usually outperform sequence-based methods [2][3][4].…”

Section: Introductionmentioning

confidence: 99%

Accurate Protein Function Prediction via Graph Attention Networks with Predicted Structure Information

2021

Preprint

View full text Add to dashboard Cite

Experimental protein function annotation does not scale with the fast-growing sequence databases. Only a tiny fraction (<0.1%) of protein sequences in UniProtKB has experimentally determined functional annotations. Computational methods may predict protein function in a high-throughput way, but its accuracy is not very satisfactory. Based upon recent breakthroughs in protein structure prediction and protein language models, we develop GAT-GO, a graph attention network (GAT) method that may substantially improve protein function prediction by leveraging predicted inter-residue contact graphs and protein sequence embedding. Our experimental results show that GAT-GO greatly outperforms the latest sequence- and structure-based deep learning methods. On the PDB-mmseqs testset where the train and test proteins share <15% sequence identity, GAT-GO yields Fmax(maximum F-score) 0.508, 0.416, 0.501, and AUPRC(area under the precision-recall curve) 0.427, 0.253, 0.411 for the MFO, BPO, CCO ontology domains, respectively, much better than homology-based method BLAST (Fmax 0.117,0.121,0.207 and AUPRC 0.120, 0.120, 0.163). On the PDB-cdhit testset where the training and test proteins share higher sequence identity, GAT-GO obtains Fmax 0.637, 0.501, 0.542 for the MFO, BPO, CCO ontology domains, respectively, and AUPRC 0.662, 0.384, 0.481, significantly exceeding the just-published graph convolution method DeepFRI, which has Fmax 0.542, 0.425, 0.424 and AUPRC 0.313, 0.159, 0.193.

show abstract

Annotating Gene Ontology terms for protein sequences with the Transformer model

Cited by 9 publications

References 17 publications

TALE: Transformer-based protein function Annotation with joint sequence–Label Embedding

TALE: Transformer-based protein function Annotation with joint sequence–Label Embedding

PFresGO: an attention mechanism-based deep-learning approach for protein annotation by integrating gene ontology inter-relationships

Accurate Protein Function Prediction via Graph Attention Networks with Predicted Structure Information

Contact Info

Product

Resources

About