2022
DOI: 10.48550/arxiv.2201.11147
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

OntoProtein: Protein Pretraining With Gene Ontology Embedding

Abstract: Self-supervised protein language models have proved their effectiveness in learning the proteins representations. With the increasing computational power, current protein language models pre-trained with millions of diverse sequences can advance the parameter scale from million-level to billion-level and achieve remarkable improvement. However, those prevailing approaches rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowledge facts for better protein representations. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(14 citation statements)
references
References 26 publications
0
8
0
Order By: Relevance
“… 32 This illustrates the broad-ranging importance of incorporating domain knowledge into data-driven discovery methods. A similar approach in the domain of proteins is OntoProtein, 36 which uses the Gene Ontology to augment the information available for a protein language model.…”
Section: Discussionmentioning
confidence: 99%
“… 32 This illustrates the broad-ranging importance of incorporating domain knowledge into data-driven discovery methods. A similar approach in the domain of proteins is OntoProtein, 36 which uses the Gene Ontology to augment the information available for a protein language model.…”
Section: Discussionmentioning
confidence: 99%
“…Following Rao et al (2019) and Rao et al (2021), we compare our model with vanilla protein representation models, including LSTM(Liu, 2017), Transformers(Vaswani et al, 2017) and pre-trained models ESM-1b(Rives et al, 2019), ProtBERT(Elnaggar et al, 2020). We also compare with state-of-the-art knowledge-augmentation models: Potts Model(Balakrishnan et al, 2011), MSA Transformer(Rao et al, 2021) that inject evolutionary knowledge through MSA, OntoProtein(Zhang et al, 2022) that uses gene ontology knowledge graph to augment protein representations and PMLM(He et al, 2021b) that uses pair-wise pretraining to improve co-evolution awareness. We use the reported results of LSTM from Zhang et al (2021); Xu et al (2022).…”
Section: Methodsmentioning
confidence: 99%
“…Inspired by the great success of protein pre-training [9,57,47,31,54,12], a few works attempt to transfer these techniques to the antibody pre-training since the specificity of antibody [51,50]. Ruffolo et al [38] first propose an antibody-specific language model AntiBERTy to aid understanding of immune repertoires by training on antibody sequence data.…”
Section: Antibody Pre-trainingmentioning
confidence: 99%