2022
DOI: 10.1101/2022.10.10.511571
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

Abstract: Our work seeks to transform how new and emergent variants of pandemic causing viruses, specially SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences, and then finetuning a SARS-CoV-2 specific model on 1.5 million genomes, we show that GenSLM can accurately and rapidly identify variants o… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
21
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 26 publications
(21 citation statements)
references
References 65 publications
0
21
0
Order By: Relevance
“…There have already been early efforts to develop foundation models for biological sequences 33,34 , including RFdiffusion, which generates proteins on the basis of simple specifications (for example, a binding target) 35 . Building on this work, GMAI-based solution can incorporate both language and protein sequence data during training to offer a versatile text interface.…”
Section: Bedside Decision Support Gmai Enables a New Class Of Bedside...mentioning
confidence: 99%
“…There have already been early efforts to develop foundation models for biological sequences 33,34 , including RFdiffusion, which generates proteins on the basis of simple specifications (for example, a binding target) 35 . Building on this work, GMAI-based solution can incorporate both language and protein sequence data during training to offer a versatile text interface.…”
Section: Bedside Decision Support Gmai Enables a New Class Of Bedside...mentioning
confidence: 99%
“…This deluge of data poses both an opportunity and a challenge; on the one hand, abundant genomic data able to uncover the intricate patterns of natural variability across species and populations is vastly available; on the other hand, powerful deep learning methods, that can operate at large scale, are required to perform accurate signal extraction from this large amount of unlabelled genomic data. Large foundation models trained on sequences of nucleotides appear to be the natural choice to seize the opportunity in genomics research, with attempts to explore different angles recently proposed [18, 19, 19, 20].…”
Section: Introductionmentioning
confidence: 99%
“…To alleviate the problem of insufficient data, self-supervised learning (SSL) strategy utilized in large language models (LLM) like BERT ( 14) can be adopted. The SSL strategy has been utilized to learn representations of protein (15,16), non-coding RNA (17) and prokaryote genome (18) sequences. These models were pre-trained on millions of sequences from a large number of species and therefore captured the evolutionary information that is critical for modelling.…”
Section: Introductionmentioning
confidence: 99%