2020
DOI: 10.1101/2020.03.07.982272
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

ProGen: Language Modeling for Protein Generation

Abstract: Generative modeling for protein engineering is key to solving fundamental problems in synthetic biology, medicine, and material science. We pose protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations. We train a 1.2B-parameter language model, ProGen, on ∼280M protein sequences conditioned on taxonomic and keyword tags such as molecular function and cellular component. This provides ProGen with an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
181
1

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
5

Relationship

2
8

Authors

Journals

citations
Cited by 179 publications
(184 citation statements)
references
References 41 publications
2
181
1
Order By: Relevance
“…1. Overall, BFD was about eight times larger than the largest data sets used previously [19]. Despite the 8-fold increase in data, the number of tokens increased only fivefold ( Fig.…”
Section: Data For Language Models (Lms)mentioning
confidence: 75%
“…1. Overall, BFD was about eight times larger than the largest data sets used previously [19]. Despite the 8-fold increase in data, the number of tokens increased only fivefold ( Fig.…”
Section: Data For Language Models (Lms)mentioning
confidence: 75%
“…Supervised models of protein function are currently limited by the availability and quality of experimental data but will become increasingly accurate and general as researchers continue to experimentally characterize protein sequence space. Other important machine learning advances relevant to protein engineering include generative modeling to sample non-natural protein sequences [23, 29, 30], language models to learn protein representations from diverse natural sequences [3134], and strategies to incorporate machine learning predictions into directed evolution experiments [35, 36]. Coupled with optimization methods that allow supervised models to efficiently explore new parts of the sequence space [37–40], these approaches could make possible a new generation of data-driven protein engineering.…”
Section: Discussionmentioning
confidence: 99%
“…TAPE created a benchmark of five tasks ranging from remote homology to fluorescence prediction to assess protein representation learning models. In [48,63], autoregressive generative models were trained to predict the functional effect of mutations and generate natural-like proteins. All aforementioned studies have been the subject of rapid development, with Transformer architectures seemingly providing the most promising avenue for future research.…”
Section: Protein Language Modelsmentioning
confidence: 99%