2022
DOI: 10.1101/2022.03.09.483666
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A deep unsupervised language model for protein design

Abstract: Protein design aims to build new proteins from scratch, thereby holding the potential to tackle many environmental and biomedical problems. Recent progress in the field of natural language processing (NLP) has enabled the implementation of ever-growing language models capable of understanding and generating text with human-like capabilities. Given the many similarities between human languages and protein sequences, the use of NLP models offers itself for predictive tasks in protein research. Motivated by the e… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
21
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 23 publications
(21 citation statements)
references
References 43 publications
0
21
0
Order By: Relevance
“…We next cover additional related work on generative models of proteins sequence and structure, beyond the discussion in Section 1.1. Following the success of deep language models, Ferruz et al [11] developed protein sequence models to generate new proteins, but these models do not allow specification of structural motifs. Another class of methods, referred to as fixed backbone sequence design [12,20,40,27,18], attempts to solve the problem of identifying a sequence that folds into any given designable backbone structure.…”
Section: Appendix Table Of Contentsmentioning
confidence: 99%
“…We next cover additional related work on generative models of proteins sequence and structure, beyond the discussion in Section 1.1. Following the success of deep language models, Ferruz et al [11] developed protein sequence models to generate new proteins, but these models do not allow specification of structural motifs. Another class of methods, referred to as fixed backbone sequence design [12,20,40,27,18], attempts to solve the problem of identifying a sequence that folds into any given designable backbone structure.…”
Section: Appendix Table Of Contentsmentioning
confidence: 99%
“…Large pretrained protein language models, largely relying on the attention-based transformer [Vaswani et al, 2017] architecture, have advanced the ability of machine-learning methods to predict protein structure and function from sequence, especially when labeled training data is sparse. Most modern self-supervised protein sequence pretraining combines a transformer model with either an autoregressive likelihood [Madani et al, 2020, 2021, Ferruz et al, 2022, Hesslow et al, 2022] or with the masked language modeling (MLM) task introduced for natural language by BERT (bidirectional encoder representations from transformers) [Devlin et al, 2018]. Pretrained transformer protein MLMs contain structural information [Rao et al, 2019, Rives et al, 2021, Chowdhury et al, 2021], encode evolutionary trajectories [Hie et al, 2022a, 2021], are zero-shot predictors of mutation fitness effects [Meier et al, 2021], improve out-of-domain generalization on protein engineering datasets [Dallago et al, 2021], and suggest improved sequences for engineering [Hie et al, 2022b].…”
Section: Introductionmentioning
confidence: 99%
“…Sequence-only protein language models A large body of work has recently studied the application of language models to sequence-only protein generation [Madani et al, 2020, Shin et al, 2021, Ferruz et al, 2022, Hesslow et al, 2022 and representation learning. See Wu et al [2021] for a more comprehensive review of de novo protein sequence design with deep generative models.…”
Section: Related Workmentioning
confidence: 99%