2022
DOI: 10.1101/2022.11.17.516989
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Running ahead of evolution - AI based simulation for predicting future high-risk SARS-CoV-2 variants

Abstract: The never-ending emergence of SARS-CoV-2 variations of concern (VOCs) has challenged the whole world for pandemic control. In order to develop effective drugs and vaccines, one needs to efficiently simulate SARS- CoV-2 spike receptor binding domain (RBD) mutations and identify high-risk variants. We pretrain a large pro- tein language model on approximately 408 million pro- tein sequences and construct a high-throughput screen- ing for the prediction of binding affinity and antibody escape. As the first work o… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(12 citation statements)
references
References 82 publications
0
12
0
Order By: Relevance
“…A related work, ProtFound [CNW + 22], uses masked language modeling to generate mutated subsequences within the RBD sub-segment of the Spike protein. To do this, positions in the Spike sub-sequence are masked and the model is asked to fill in the masked locations.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…A related work, ProtFound [CNW + 22], uses masked language modeling to generate mutated subsequences within the RBD sub-segment of the Spike protein. To do this, positions in the Spike sub-sequence are masked and the model is asked to fill in the masked locations.…”
Section: Resultsmentioning
confidence: 99%
“…However, the problem of advance generation of the complete Spike protein sequence has not been studied before. While PLMs have been used to label Spike mutations with properties [HZBB21] [MBW + 22] or to generate individual mutations, or sub-sequences within certain regions of the Spike protein [Dho23] [CNW + 22], they have not been applied to the problem of advance generation of complete Spike protein sequences. Other machine learning methods have been developed to analyze the Spike protein, where they are used to characterize individual Spike mutations or effects of mutations on Spike sub-sequences [OJB + 22] [TWG + 22] [WLW + 23] [HLZ + 22].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In our previous work [16], we proposed a pipeline for SARS-CoV-2 mutation simulation that approximates a lineage through high-throughput variant generation and screening. In this work, we go one step further and develop an evolution-inspired framework, ProtFound-V ( Prot ein Found ation Model for V irus), for viral property prediction (Fig.1).…”
Section: Mainmentioning
confidence: 99%
“…In the case of protein sequence generation, quality is usually determined in silico [ 8 , 13 ] or in vitro [ 9 ] by measuring a small set of properties of the generated sequences in aggregate, such as sequence similarity, stability, and fold prediction. Prior generative work related to the Spike protein only evaluated individual mutations [ 16 ], or generated sub-segments of the Spike protein carrying point mutations using methods that are not scalable to the full sequence generation task [ 17 ] (where possible, we have included prior work in our experiments). Thus, the generated output is either a response to a prefix in the case of language (partial generation), not entirely correct or incorrect based on small discrete set membership (language, prior bioinformatics works), or only examines sub-sequences or individual mutations of the virus (partial generation).…”
Section: Introductionmentioning
confidence: 99%