2023
DOI: 10.1101/2023.07.11.548628
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

DNAGPT: A Generalized Pre-trained Tool for Multiple DNA Sequence Analysis Tasks

Abstract: The success of the GPT series proves that GPT can extract general information from sequences, thereby benefiting all downstream tasks. This motivates us to use pre-trained models to explore the hidden information in DNA sequences. However, data and task requirements in DNA sequence analysis are complexity and diversity as DNA relevant data includes different types of information, such as sequences, expression levels, etc, while there is currently no model specifically designed for these characteristics. Hereby… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
8
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 11 publications
(8 citation statements)
references
References 54 publications
0
8
0
Order By: Relevance
“…Recently, there has been a surge of pre-trained gLMs [20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][38][39][40][41][42][43] . gLMs take as input DNA sequences that have undergone tokenization, an encoding scheme applied to either a single nucleotide or k-mer of nucleotides.…”
Section: Introductionmentioning
confidence: 99%
See 3 more Smart Citations
“…Recently, there has been a surge of pre-trained gLMs [20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][38][39][40][41][42][43] . gLMs take as input DNA sequences that have undergone tokenization, an encoding scheme applied to either a single nucleotide or k-mer of nucleotides.…”
Section: Introductionmentioning
confidence: 99%
“…Current gLMs are composed of different choices for the tokenization, base architecture, language modeling objective, and pre-training data. Tokenization of DNA sequences is employed for either single nucleotide [20][21][22] or k-mer of fixed size [23][24][25] or a k-mer of variable sizes via byte-pair tokenization 26,27,45 , which aims to aggregate DNA in a manner that reduces the k-mer bias in the genome, a problem known as rare token imbalance. The base architecture is typically a stack of transformer layers 46 , with a vanilla multi-head self-attention [23][24][25][27][28][29][30][31] or an exotic attention variant (e.g., flash attention 26,47 , sparse attention 32,33 , or axial attention 34,48 ).…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…However, their structural similarity to human language (long strings consisting of basic units such as bases or words) provides opportunities for modeling and interpreting DNA sequences using NLP methods[8, 9]. Scientists are increasingly leveraging pre-trained genomic models, leading to significant successes with Transformer-based frameworks [10, 11]and other language framework models[9]. For sequence classification evaluation, Researchers have constructed benchmark datasets for DNA classification and used modified CNN models as the baselines.…”
Section: Introductionmentioning
confidence: 99%