2021
DOI: 10.1038/s41598-020-80670-x
|View full text |Cite
|
Sign up to set email alerts
|

A deep learning framework combined with word embedding to identify DNA replication origins

Abstract: The DNA replication influences the inheritance of genetic information in the DNA life cycle. As the distribution of replication origins (ORIs) is the major determinant to precisely regulate the replication process, the correct identification of ORIs is significant in giving an insightful understanding of DNA replication mechanisms and the regulatory mechanisms of genetic expressions. For eukaryotes in particular, multiple ORIs exist in each of their gene sequences to complete the replication in a reasonable pe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
10
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 13 publications
(10 citation statements)
references
References 56 publications
0
10
0
Order By: Relevance
“…In this study, we considered triplets as words given their role in protein coding genes as codons as well as their possible roles in primordial RNA synthesis as discussed in the introduction. In addition, the use of triplets as words during language-based modeling of genome sequences by previous studies has demonstrated that they capture important biological information that can be used for different prediction tasks such as the inference of DNA replication origins [93], identification of enhancers [94] or transcription factor binding sites [95]. Therefore, to apply word2vec to the metagenomes, each species level metagenome sequence was split into non-overlapping triplets, starting from the 5'-end.…”
Section: Training Of the Language (Embedding) Modelsmentioning
confidence: 99%
See 1 more Smart Citation
“…In this study, we considered triplets as words given their role in protein coding genes as codons as well as their possible roles in primordial RNA synthesis as discussed in the introduction. In addition, the use of triplets as words during language-based modeling of genome sequences by previous studies has demonstrated that they capture important biological information that can be used for different prediction tasks such as the inference of DNA replication origins [93], identification of enhancers [94] or transcription factor binding sites [95]. Therefore, to apply word2vec to the metagenomes, each species level metagenome sequence was split into non-overlapping triplets, starting from the 5'-end.…”
Section: Training Of the Language (Embedding) Modelsmentioning
confidence: 99%
“…• First each genome was split into non-overlapping triplets "TAT ACG GGG AAA ...". In previous biological sequence analysis tasks [93,94], overlapping K-mers were used during word embedding. However, in this study the use of overlapping K-mers would introduce a statistical artefact in examining the relationships between adjacent triplets since overlapping triplets obtained by a sliding through the sequence 1-nucleotide a time would be predictably different by 1 nucleotide.…”
Section: Training Of the Language (Embedding) Modelsmentioning
confidence: 99%
“…Typically, in language models, word embeddings represent the latent space (dimensions) of a corpus of text (Mikolov et al, 2013) and can capture highly nonlinear and contextual relationships. Codons (tri-nucleotides, 3-mers) translations represent a natural basis for word representations and have been utilized in the past for learning embedding models for modeling various outcomes such as mutation susceptibility (Yilmaz, 2020) and gene sequence correlations (Wu et al, 2021). Recently, Hie et.…”
Section: Introductionmentioning
confidence: 99%
“…However, there is a paucity of studies that explore the use of unsupervised embeddings for machine learning based prediction of surges in infections. In these models, codons (tri-nucleotides, 3-mers) translations represent a natural basis for word representations and have been utilized in the past for learning embedding models for modelling various outcomes such as mutation susceptibility and gene sequence correlations (Yilmaz, 2020) (Wu et al, 2021). Recently, Hie et.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation