2016
DOI: 10.1038/srep31900
|View full text |Cite
|
Sign up to set email alerts
|

DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies

Abstract: The highly anticipated transition from next generation sequencing (NGS) to third generation sequencing (3GS) has been difficult primarily due to high error rates and excessive sequencing cost. The high error rates make the assembly of long erroneous reads of large genomes challenging because existing software solutions are often overwhelmed by error correction tasks. Here we report a hybrid assembly approach that simultaneously utilizes NGS and 3GS data to address both issues. We gain advantages from three gen… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
254
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 292 publications
(255 citation statements)
references
References 33 publications
1
254
0
Order By: Relevance
“…The first one utilized Illumina and PacBio sequencing reads together (hybrid assembly) and the second one – PacBio data only (PacBio‐only assembly). The software used to accomplish the hybrid assembly was DBG2OLC package (Ye et al ., 2016). This assembly included Illumina PE raw reads and raw PacBio reads.…”
Section: Methodsmentioning
confidence: 99%
“…The first one utilized Illumina and PacBio sequencing reads together (hybrid assembly) and the second one – PacBio data only (PacBio‐only assembly). The software used to accomplish the hybrid assembly was DBG2OLC package (Ye et al ., 2016). This assembly included Illumina PE raw reads and raw PacBio reads.…”
Section: Methodsmentioning
confidence: 99%
“…Using DNA from the leaves of GDDH13, we generated ~120-fold coverage of Illumina paired-end reads (72 Gb), 80-fold coverage of Illumina Nextera mate-pair reads (58 Gb) at three different insert sizes (2, 5 and 10 kb) and ~35-fold coverage of PacBio sequencing data (24 Gb; 2,837,045 subreads with a mean length of 8,474 bp). The Illumina paired-end reads were first assembled using SOAPdenovo 25 , and the resulting contigs were combined with the PacBio reads using the DBG2OLC assembler 26 . This resulted in an assembly that consisted of 2,150 contigs with an N50 of 620 kb (i.e., 50% of the assembly was contained in contigs ≥620 kb in size) (Supplementary Table 1) and a total length of 625.2 Mb, which were subsequently corrected by using the Illumina paired-end reads (94,896 single-base assembly errors corrected; 1,054,709 insertions (1,466,015 bp) and 123,510 deletions (178,733 bp)) and scaffolded by using Illumina mate-pair reads with BESST (assembly N50 increased from 620 kb to 699 kb).…”
Section: Genome Sequencing Assembly and Scaffoldingmentioning
confidence: 99%
“…25). Next, the PacBio reads and Illumina contigs were combined to perform a hybrid assembly using the DBG2OLC pipeline 26 .…”
Section: Author Contributionsmentioning
confidence: 99%
“…In addition, we used Canu to precorrect the original reads and assembled the resulting data using SMARTdenovo (hereafter CanuSMARTdenovo) as described in the Supplemental Methods. Assemblies of the genome with the hybrid assembler dbg2olc (Ye et al, 2016) and an early version of the wtdbg assembler had subpar N50 values and were thus not analyzed further (Supplemental Data Sets 1A and 1B However, a newer version of Canu significantly lowered the consumed CPU hours from ;80k to 14.36k CPU hours, closing the speed gap to the other assemblers.…”
Section: Genome Assembly Strategies and Metricsmentioning
confidence: 99%