2021
DOI: 10.1093/bioinformatics/btab083
|View full text |Cite
|
Sign up to set email alerts
|

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Abstract: Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. Results To address this challenge, we developed a novel pre-trained bidirectional encoder represen-tation, named DNABER… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
535
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 521 publications
(536 citation statements)
references
References 53 publications
1
535
0
Order By: Relevance
“…1). Transformers are a class of deep learning models that have achieved substantial breakthroughs in natural language processing (NLP) 6,7 and were also recently applied to model short DNA sequences 8 . They consist of attention layers that transform each position in the input sequence by computing a weighted sum across the representations of all other positions in the sequence.…”
Section: Resultsmentioning
confidence: 99%
“…1). Transformers are a class of deep learning models that have achieved substantial breakthroughs in natural language processing (NLP) 6,7 and were also recently applied to model short DNA sequences 8 . They consist of attention layers that transform each position in the input sequence by computing a weighted sum across the representations of all other positions in the sequence.…”
Section: Resultsmentioning
confidence: 99%
“…Supervised deep learning methods for the prediction of TF occupancy data and and chromatin accessibility are numerous, ranging from the early deep convolutional neural network based pipelines such as DeepSEA 10 and Basset 9 to more recent approaches usually mirroring advances in deep learning methods for natural language processing, such as the LSTM-based DanQ 23 , Basenji using dilated CNNs 11 , DeepSite 24 , and DNA-BERT 25 . While these models have produced highly accurate predictions of TF occupancy, the interpretation of models requires detailed feature attribution (e.g.…”
Section: Discussionmentioning
confidence: 99%
“…DNABERT , in contrast, is the only model, currently, to pre-train BERT-based models using a whole human reference genome [68]. During preprocessing, the genome, whose gaps and unannotated regions were excluded, was split into 5 to 510 consequent nucleotide sequences without overlapping and subsequently converted to 3-to 6-mer representations.…”
Section: Survey Of Representation Learning Applications In Sequence Amentioning
confidence: 99%