2017
DOI: 10.1038/s41598-017-14017-4
|View full text |Cite
|
Sign up to set email alerts
|

Machine learning model for sequence-driven DNA G-quadruplex formation

Abstract: We describe a sequence-based computational model to predict DNA G-quadruplex (G4) formation. The model was developed using large-scale machine learning from an extensive experimental G4-formation dataset, recently obtained for the human genome via G4-seq methodology. Our model differentiates many widely accepted putative quadruplex sequences that do not actually form stable genomic G4 structures, correctly assessing the G4 folding potential of over 700,000 such sequences in the human genome. Moreover, our appr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

1
139
0
1

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
2
2

Relationship

2
7

Authors

Journals

citations
Cited by 132 publications
(141 citation statements)
references
References 37 publications
1
139
0
1
Order By: Relevance
“…The motifs were found to be enriched in regulatory regions, especially promoters, first introns, and telomeres . Subsequent studies have led to the broadening of the G4‐motif definition and to the ongoing refinement of G4‐mining algorithms . However, major G4‐prone genomic fragments, including telomeres and oncogene promoters, are well established and the bioinformatics predictions are generally consistent with those from the G4‐sequencing data .…”
Section: Introductionmentioning
confidence: 94%
“…The motifs were found to be enriched in regulatory regions, especially promoters, first introns, and telomeres . Subsequent studies have led to the broadening of the G4‐motif definition and to the ongoing refinement of G4‐mining algorithms . However, major G4‐prone genomic fragments, including telomeres and oncogene promoters, are well established and the bioinformatics predictions are generally consistent with those from the G4‐sequencing data .…”
Section: Introductionmentioning
confidence: 94%
“…Thus, whole‐genome sequencing experiments and bioinformatics studies predict the formation of numerous G4 structures in the genome . However, in spite of some attempts, the stability and folding topology of G4‐DNA and G4‐RNA are difficult to predict purely on the basis of sequence information . A growing body of structural and biophysical data demonstrates that these structures frequently escape the consensus motif (i.e., G n ‐N i ‐G n ‐N j ‐G n ‐N k ‐G n , in which n= 2 to 4; i , j , k =1 to 7; and N=any base), as evidenced by snap‐back, G‐vacant, and bulged G4 structures which are difficult to predict by bioinformatics algorithms .…”
Section: Introductionmentioning
confidence: 99%
“…Conversely, we expected G4 motifs to be more abundant in long read assemblies, since these have 353 been suggested to be virtually free from sequence-based biases (Eid et al 2009). To test this, we 354 predicted the presence of G4 motifs using Quadron (Sahakyan et al 2017) in all the different 355 assemblies. All the de-novo Illumina-based assemblies had fewer predicted G4 sites the PacBio 356 assemblies (Figure 3c and Supplementary Table S8).…”
mentioning
confidence: 99%
“…The final repeat library also contains the manually curated version of the consensus sequences 969 previously generated on other two birds-of-paradise Astrapia rothschildi "astRot", Ptiloris overlapping hits with a score greater than 19 were used for subsequent analysis as suggested in 978 (Sahakyan et al 2017). The density of such motifs per chromosome model was calculated using 979 bedtools coverage (BEDTools 2.27.1; Quinlan (2014)).…”
mentioning
confidence: 99%