2023
DOI: 10.1101/2023.01.17.524472
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A deep generative model of the SARS-CoV-2 spike protein predicts future variants

Abstract: SARS-CoV-2 has demonstrated a robust ability to adapt in response to environmental pressures---increasing viral transmission and evading immune surveillance by mutating its molecular machinery. While viral sequencing has allowed for the early detection of emerging variants, methods to predict mutations before they occur remain limited. This work presents SpikeGPT2, a deep generative model based on ProtGPT2 and fine-tuned on SARS-CoV-2 spike (S) protein sequences deposited in the NIH Data Hub before May 2021. S… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 41 publications
0
7
0
Order By: Relevance
“…Prot GPT2 uses the same tokenization algorithm as GPT2. Prot GPT2 has been applied in limited capacity to the Spike protein modeling problem [Dho23], for the purposes of generating individual mutations in the Receptor Binding Domain (RBD) sub-segment of the Spike protein. In this article, we present the first application of Prot GPT2 as a de novo generator of complete Spike protein sequences.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…Prot GPT2 uses the same tokenization algorithm as GPT2. Prot GPT2 has been applied in limited capacity to the Spike protein modeling problem [Dho23], for the purposes of generating individual mutations in the Receptor Binding Domain (RBD) sub-segment of the Spike protein. In this article, we present the first application of Prot GPT2 as a de novo generator of complete Spike protein sequences.…”
Section: Resultsmentioning
confidence: 99%
“…However, the problem of advance generation of the complete Spike protein sequence has not been studied before. While PLMs have been used to label Spike mutations with properties [HZBB21] [MBW + 22] or to generate individual mutations, or sub-sequences within certain regions of the Spike protein [Dho23] [CNW + 22], they have not been applied to the problem of advance generation of complete Spike protein sequences. Other machine learning methods have been developed to analyze the Spike protein, where they are used to characterize individual Spike mutations or effects of mutations on Spike sub-sequences [OJB + 22] [TWG + 22] [WLW + 23] [HLZ + 22].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In the case of protein sequence generation, quality is usually determined in silico [ 8 , 13 ] or in vitro [ 9 ] by measuring a small set of properties of the generated sequences in aggregate, such as sequence similarity, stability, and fold prediction. Prior generative work related to the Spike protein only evaluated individual mutations [ 16 ], or generated sub-segments of the Spike protein carrying point mutations using methods that are not scalable to the full sequence generation task [ 17 ] (where possible, we have included prior work in our experiments). Thus, the generated output is either a response to a prefix in the case of language (partial generation), not entirely correct or incorrect based on small discrete set membership (language, prior bioinformatics works), or only examines sub-sequences or individual mutations of the virus (partial generation).…”
Section: Introductionmentioning
confidence: 99%
“…Training large language models from the ground up requires large datasets and vast computational resources, but pretrained models can be finetuned on specific protein families or properties. Dhodapkar (2023) introduced SpikeGPT2, a finetuned ProtGPT2 model that generates SARS-CoV-2 spike protein sequences to predict potential future mutations. Madani et al (2023) have also finetuned ProGen on protein families unseen during training.…”
Section: Introductionmentioning
confidence: 99%