2021
DOI: 10.48550/arxiv.2110.15527
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model

Abstract: Understanding protein sequences is vital and urgent for biology, healthcare, and medicine. Labeling approaches are expensive yet time-consuming, while the amount of unlabeled data is increasing quite faster than that of the labeled data due to low-cost, high-throughput sequencing methods. In order to extract knowledge from these unlabeled data, representation learning is of significant value for protein-related tasks and has great potential for helping us learn more about protein functions and structures. The … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5

Relationship

2
3

Authors

Journals

citations
Cited by 7 publications
(9 citation statements)
references
References 52 publications
0
9
0
Order By: Relevance
“…We hope that this work is the first step in investigating the independent and interaction effects of pretraining and architecture for protein sequence modeling. While we evaluate the effects of masked language model pretraining, transformers have also been used for autoregressive language model pretraining (Madani et al, 2020) and pairwise masked language modeling (He et al, 2021), and combining structural information (Mansoor et al, 2021; Zhang et al, 2022; McPartlon et al, 2022; Hsu et al, 2022; Chen et al, 2022; Wang et al, 2022) or functional annotations (Brandes et al, 2021) offers further directions for protein pretraining tasks.…”
Section: Discussionmentioning
confidence: 99%
“…We hope that this work is the first step in investigating the independent and interaction effects of pretraining and architecture for protein sequence modeling. While we evaluate the effects of masked language model pretraining, transformers have also been used for autoregressive language model pretraining (Madani et al, 2020) and pairwise masked language modeling (He et al, 2021), and combining structural information (Mansoor et al, 2021; Zhang et al, 2022; McPartlon et al, 2022; Hsu et al, 2022; Chen et al, 2022; Wang et al, 2022) or functional annotations (Brandes et al, 2021) offers further directions for protein pretraining tasks.…”
Section: Discussionmentioning
confidence: 99%
“…Following Rao et al (2019) and Rao et al (2021), we compare our model with vanilla protein representation models, including LSTM(Liu, 2017), Transformers(Vaswani et al, 2017) and pre-trained models ESM-1b(Rives et al, 2019), ProtBERT(Elnaggar et al, 2020). We also compare with state-of-the-art knowledge-augmentation models: Potts Model(Balakrishnan et al, 2011), MSA Transformer(Rao et al, 2021) that inject evolutionary knowledge through MSA, OntoProtein(Zhang et al, 2022) that uses gene ontology knowledge graph to augment protein representations and PMLM(He et al, 2021b) that uses pair-wise pretraining to improve co-evolution awareness. We use the reported results of LSTM from Zhang et al (2021); Xu et al (2022).…”
Section: Methodsmentioning
confidence: 99%
“…Large scale pre-training enables language models to learn structural and evolutionary knowledge (Elnaggar et al, 2021; Jumper et al, 2021; Lin et al, 2022). Despite these successes, many important applications still require MSAs and other external knowledge (Rao et al, 2021; Jumper et al, 2021; He et al, 2021b; Zhang et al, 2021; Ju et al, 2021; Rao et al, 2020). MSAs have been shown effective in improving representation learning, despite being extremely slow and costly in computation.…”
Section: Related Workmentioning
confidence: 99%
“…There is growing interest in developing protein language models ( p LMs) at the scale of evolution due to the abundance of 1D amino acid sequences, such as the series of ESM (Rives et al, 2019; Lin et al, 2022), TAPE (Rao et al, 2019), ProtTrans (Elnaggar et al, 2021), PRoBERTa (Nambiar et al, 2020), PMLM (He et al, 2021), ProteinLM (Xiao et al, 2021), PLUS (Min et al, 2021), Adversarial MLM (McDermott et al, 2021), ProteinBERT (Brandes et al, 2022), CARP (Yang et al, 2022a) in masked language modeling (MLM) fashion, ProtGPT2 (Ferruz et al, 2022) in causal language modeling fashion, and several others (Melnyk et al, 2022a; Madani et al, 2021; Unsal et al, 2022; Nourani et al, 2021; Lu et al, 2020; Sturmfels et al, 2020; Strodthoff et al, 2020). These protein language models are able to generalize across a wide range of downstream applications and can capture evolutionary information about secondary and tertiary structures from sequences alone.…”
Section: Related Workmentioning
confidence: 99%