Running ahead of evolution - AI based simulation for predicting future high-risk SARS-CoV-2 variants

Chen, Jie; Nie, Zhiwei; Yu, Wang; Wang, Kai; Xu, Fan; Hu, Yaqin; Zheng, Bin; Wang, Zhennan; Song, Guoli; Zhang, Jingyi; Fu, Jie; Huang, Xiansong; Wang, Zhongqi; Ren, Zhixiang; Wang, Qiankun; Li, Daixi; Wei, Dong‐Qing; Zhou, Bin; Yang, Chao

doi:10.1101/2022.11.17.516989

Cited by 3 publications

(12 citation statements)

References 82 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A related work, ProtFound [CNW + 22], uses masked language modeling to generate mutated subsequences within the RBD sub-segment of the Spike protein. To do this, positions in the Spike sub-sequence are masked and the model is asked to fill in the masked locations.…”

Section: Resultsmentioning

confidence: 99%

“…However, the problem of advance generation of the complete Spike protein sequence has not been studied before. While PLMs have been used to label Spike mutations with properties [HZBB21] [MBW + 22] or to generate individual mutations, or sub-sequences within certain regions of the Spike protein [Dho23] [CNW + 22], they have not been applied to the problem of advance generation of complete Spike protein sequences. Other machine learning methods have been developed to analyze the Spike protein, where they are used to characterize individual Spike mutations or effects of mutations on Spike sub-sequences [OJB + 22] [TWG + 22] [WLW + 23] [HLZ + 22].…”

Section: Introductionmentioning

confidence: 99%

“…In the case of protein sequence generation, quality is usually determined in silico [FSH22] [NRW + 22] or in vitro [SRK + 21] by measuring a small set of properties of the generated sequences in aggregate, such as sequence similarity, stability, and fold prediction. Prior generative work related to the Spike protein only evaluated individual mutations [Dho23], or generated sub-segments of the Spike protein carrying point mutations using methods that are not scalable to the full sequence generation task [CNW + 22] (where possible, we have included prior work in our experiments). Thus, the generated output is either a response to a prefix in the case of language (partial generation), not entirely correct or incorrect based on small discrete set membership (language, prior bioinformatics works), or only examines sub-sequences or individual mutations of the virus (partial generation).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning

Ramachandran

Lumetta

Chen

2023

Preprint

View full text Add to dashboard Cite

Deep generative models have achieved state-of-the-art performance in many areas including image generation, code generation and natural language generation. We explore the use of deep generative models in producing complete instances of as-yet undiscovered SARS-CoV2 Spike protein sequences. The Spike protein is the primary initiator of infection by the SARS-CoV2 virus, and hence, the ability to predict future manifestations of the Spike protein is invaluable, enabling critical tasks such as advance validation of pharmaceutical interventions. We examine specific requirements of generating sequences for a pandemic and formulate a novel framework for training models for these requirements. Our solution only uses sequence information submitted in SARS-CoV2 repositories without the need for additional laboratory experiments. Resulting models substantially outperform a state-of-the-art generative model for protein sequences finetuned on SARS-CoV2 data. Samples produced from our models are four times as likely to be novel and real SARS-CoV2, and ten times as infectious, cumulatively. We find that among higher ranked sequences generated from our model, over 70% are discovered in the future, over twice the rate of the baseline. Our models represent a promising source of hypothetical SARS-CoV2 sequences, thus providing a key tool for advance preparation against the pandemic. PandoGen is available athttps://github.com/UIUC-ChenLab/PandoGen

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning

Ramachandran

Lumetta

Chen

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…In our previous work [16], we proposed a pipeline for SARS-CoV-2 mutation simulation that approximates a lineage through high-throughput variant generation and screening. In this work, we go one step further and develop an evolution-inspired framework, ProtFound-V ( Prot ein Found ation Model for V irus), for viral property prediction (Fig.1).…”

Section: Mainmentioning

confidence: 99%

E2VD: a unified evolution-driven framework for virus variation drivers prediction

Nie,

Liu,

Chen

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

Emerging viral infections, especially the global pandemic COVID-19, have had catastrophic impacts on public health worldwide. The culprit of this pandemic, SARS-CoV-2, continues to evolve, giving rise to numerous sublineages with distinct characteristics. The traditional post-hoc wet-lab approach is lagging behind, and it cannot quickly predict the evolutionary trends of the virus while consuming high costs. Capturing the evolutionary drivers of virus and predicting potential high-risk mutations has become an urgent and critical problem to address. To tackle this challenge, we introduce ProtFound-V, an evolution-inspired deeplearning framework designed to explore the mutational trajectory of virus. Take SARS-CoV-2 as an example, ProtFound-V accurately identifies the evolutionary advantage of Omicron and proposes evolutionary trends consistent with wetlab experiments throughin silicodeep mutational scanning. This showcases the potential of deep learning predictions to replace traditional wet-lab experimental measurements. With the evolution-guided large language model, ProtFound-V presents a new state-of-the-art performance in key property predictions. Despite the challenge posed by epistasis to model generalization, ProtFound-V remains robust when extrapolating to lineages with different genetic backgrounds. Overall, this work paves the way for rapid responses to emerging viral infections, allowing for a plug-and-play approach to understanding and predicting virus evolution.

show abstract

“…In the case of protein sequence generation, quality is usually determined in silico [ 8 , 13 ] or in vitro [ 9 ] by measuring a small set of properties of the generated sequences in aggregate, such as sequence similarity, stability, and fold prediction. Prior generative work related to the Spike protein only evaluated individual mutations [ 16 ], or generated sub-segments of the Spike protein carrying point mutations using methods that are not scalable to the full sequence generation task [ 17 ] (where possible, we have included prior work in our experiments). Thus, the generated output is either a response to a prefix in the case of language (partial generation), not entirely correct or incorrect based on small discrete set membership (language, prior bioinformatics works), or only examines sub-sequences or individual mutations of the virus (partial generation).…”

Section: Introductionmentioning

confidence: 99%

PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning

Ramachandran,

Lumetta,

Chen

2024

PLoS Comput Biol

View full text Add to dashboard Cite

One of the challenges in a viral pandemic is the emergence of novel variants with different phenotypical characteristics. An ability to forecast future viral individuals at the sequence level enables advance preparation by characterizing the sequences and closing vulnerabilities in current preventative and therapeutic methods. In this article, we explore, in the context of a viral pandemic, the problem of generating complete instances of undiscovered viral protein sequences, which have a high likelihood of being discovered in the future using protein language models. Current approaches to training these models fit model parameters to a known sequence set, which does not suit pandemic forecasting as future sequences differ from known sequences in some respects. To address this, we develop a novel method, called PandoGen, to train protein language models towards the pandemic protein forecasting task. PandoGen combines techniques such as synthetic data generation, conditional sequence generation, and reward-based learning, enabling the model to forecast future sequences, with a high propensity to spread. Applying our method to modeling the SARS-CoV-2 Spike protein sequence, we find empirically that our model forecasts twice as many novel sequences with five times the case counts compared to a model that is 30× larger. Our method forecasts unseen lineages months in advance, whereas models 4× and 30× larger forecast almost no new lineages. When trained on data available up to a month before the onset of important Variants of Concern, our method consistently forecasts sequences belonging to those variants within tight sequence budgets.

show abstract

Running ahead of evolution - AI based simulation for predicting future high-risk SARS-CoV-2 variants

Cited by 3 publications

References 82 publications

PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning

PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning

E2VD: a unified evolution-driven framework for virus variation drivers prediction

PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning

Contact Info

Product

Resources

About