2022
DOI: 10.1101/2022.08.22.504706
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

DNA language models are powerful predictors of genome-wide variant effects

Abstract: Variant effect prediction has traditionally focused on training supervised models on labeled data. Motivated by recent advances in natural language processing that have demonstrated substantial gains on diverse tasks by pre-training on large unlabeled data, however, unsupervised pre-training on massive databases of protein sequences has proven to be an effective approach to extracting complex information about proteins. Such models have been shown to learn variant effects in coding regions in a zero-shot man… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
61
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 33 publications
(62 citation statements)
references
References 77 publications
1
61
0
Order By: Relevance
“…Previous work has established MLM, specifically masked nucleotide modeling, as a method to pre-train models for genomics (Ji et al ., 2021; Benegas et al ., 2022). It was shown that these models capture some semantics of the regulatory code.…”
Section: Introductionmentioning
confidence: 99%
“…Previous work has established MLM, specifically masked nucleotide modeling, as a method to pre-train models for genomics (Ji et al ., 2021; Benegas et al ., 2022). It was shown that these models capture some semantics of the regulatory code.…”
Section: Introductionmentioning
confidence: 99%
“…Besides, we may employ distill learning techniques to transfer the knowledge in LLMs to lightweight architectures like CNN to handle long sequences. Besides, given recent studies have shown that some improved versions of CNNs can achieve superior performance to Transformer-based models (49, 50), it is also promising to develop pre-trained genomic sequence language models without Transformer (51).…”
Section: Discussionmentioning
confidence: 99%
“…As we have shown here, researchers do not need to start from scratch and can repurpose existing networks for evolutionary applications. Unsurprisingly, such repurposing has led to several creative purposes well outside of the developer's initial intention, such as applying cycleGANs [72] to normalize single cell RNA-seq data across datasets [73] and using natural language processing models to predict the effect of nucleotide variants [74]. At a time when unprecedented resources are being used to develop deep-learning technologies outside of biology, biologists from all fields should consider the myriad ways in which these technologies can be leveraged to solve problems and answer longstanding questions.…”
Section: Discussionmentioning
confidence: 99%
“…2017) to normalize single cell RNA-seq data across datasets (Khan et al . 2022) and using natural language processing models to predict the effect of nucleotide variants (Benegas et al . 2022).…”
Section: Discussionmentioning
confidence: 99%