Canu: scalable and accurate long-read assembly via adaptive<i>k</i>-mer weighting and repeat separation

Koren, Sergey; Walenz, Brian P.; Berlin, Konstantin; Miller, Jason R.; Bergman, Nicholas H.; Phillippy, Adam M.

doi:10.1101/gr.215087.116

Cited by 6,105 publications

(4,447 citation statements)

References 78 publications

Supporting

Mentioning

4,185

Contrasting

Unclassified

Order By: Relevance

“…The data were assembled using Canu (Koren et al, 2017) and SMARTdenovo, which represent state-of-the-art assemblers known to support Oxford nanopore sequencing technology (Istace et al, 2017). Furthermore, data were assembled with miniasm (Li, 2016), which is a fast assembler without a consensus step, thus necessitating a postassembly polishing and/or consensus step.…”

Section: Genome Assembly Strategies and Metricsmentioning

confidence: 99%

De Novo Assembly of a New Solanum pennellii Accession Using Nanopore Sequencing

et al. 2017

Self Cite

View full text Add to dashboard Cite

Updates in nanopore technology have made it possible to obtain gigabases of sequence data. Prior to this, nanopore sequencing technology was mainly used to analyze microbial samples. Here, we describe the generation of a comprehensive nanopore sequencing data set with a median read length of 11,979 bp for a self-compatible accession of the wild tomato species Solanum pennellii. We describe the assembly of its genome to a contig N50 of 2.5 MB. The assembly pipeline comprised initial read correction with Canu and assembly with SMARTdenovo. The resulting raw nanopore-based de novo genome is structurally highly similar to that of the reference S. pennellii LA716 accession but has a high error rate and was rich in homopolymer deletions. After polishing the assembly with Illumina reads, we obtained an error rate of <0.02% when assessed versus the same Illumina data. We obtained a gene completeness of 96.53%, slightly surpassing that of the reference S. pennellii. Taken together, our data indicate that such long read sequencing data can be used to affordably sequence and assemble gigabase-sized plant genomes.

show abstract

Section: Genome Assembly Strategies and Metricsmentioning

confidence: 99%

De Novo Assembly of a New Solanum pennellii Accession Using Nanopore Sequencing

et al. 2017

Self Cite

View full text Add to dashboard Cite

show abstract

“…Yet, we obtained complete and single-contig assemblies for both the PacBio and ONT E. coli long read datasets, using KD-tree together with the Canu assembler (v1.5; Koren et al, 2017). This indicates that exhaustive overlap detection is not necessary for genome assembly, especially if the overlapping is precise and sequencing depth high.…”

Section: Sensitivity and Precisionmentioning

confidence: 99%

“…They are reported in the MHAP output format, a tabular format compatible with e.g. the Canu assembler (Koren et al, 2017).…”

Section: Filtering Of Read Overlapsmentioning

confidence: 99%

Fast and memory-efficient noisy read overlapping with KD-trees

Parkhomchuk

Bremges

McHardy

2017

Preprint

View full text Add to dashboard Cite

Motivation: Third-generation sequencing technologies produce long, but noisy reads with increasing sequencing throughput and decreasing per-base costs. Detecting read-to-read overlaps in such data is the most computationally intensive step in de novo assembly. Recently, efficient algorithms were developed for this task; nearly all of these utilize long k-mers (>10 bp) to compare reads, but vary in their approaches to indexing, hashing, filtering, and dimensionality reduction. Results:We describe an algorithm for efficient overlap detection that directly compares the full spectrum of short k-mers, namely tetramers, through geometric embedding and approximate nearest neighbor search in multidimensional KD-trees. A proof of concept implementation detected read-toread overlaps in bacterial PacBio and ONT datasets with notably lower memory consumption than state-of-the-art approaches and allowed downstream de novo assembly into single contigs. We also introduce a sequence-context dependent tagging scheme that contributes to memory and computational efficiency and could be used with other aligning and overlapping algorithms.

show abstract

“…The dataset is available from (R. Wick 2017a) and detailed in Supplementary Table 3. *or comparison Canu (Koren et al 2017) (version 1.5) assemblies and Unicycler (Wick et al 2016) (version 0.4.0) assemblies, utilising Miniasm (Li 2016) and Racon (Vaser et al 2017), post Nanopolish (https://github.com/jts/nanopolish) (version 0.7.0), created using only the long read data (R. Wick 2017b) were used. There are currently no large publicly available ONT datasets.…”

Section: Ont Samplesmentioning

confidence: 99%

Peer Review #2 of "Rapid multi-locus sequence typing direct from uncorrected long reads using Krocus (v0.1)"

2018

View full text Add to dashboard Cite

Genome sequencing is rapidly being adopted in reference labs and hospitals for bacterial outbreak investigation and diagnostics where time is critical. Seven gene multi-locus sequence typing is a standard tool for broadly classifying samples into sequence types, allowing, in many cases, to rule a sample out of an outbreak, or allowing for general characteristics about a bacterial strain to be inferred. Long read sequencing technologies, such as from Oxford Nanopore, can produce read data within minutes of an experiment starting, unlike short read sequencing technologies which require many hours/days.However, the error rates of raw uncorrected long read data are very high. We present Krocus which can predict a sequence type directly from uncorrected long reads, and which was designed to consume read data as it is produced, providing results in minutes. It is the only tool which can do this from uncorrected long reads. We tested Krocus on over 700 isolates sequenced using long read sequencing technologies from PacBio and Oxford Nanopore. It provides sequence types for isolates on average within 90 seconds, with a sensitivity of 94% and specificity of 97% on real sample data, directly from uncorrected raw sequence reads. The software is written in Python and is available under the open source license GNU GPL version 3. Manuscript to be reviewed Abstract:Genome sequencing is rapidly being adopted in reference labs and hospitals for bacterial outbreak investigation and diagnostics where time is critical. Seven gene multi-locus sequence typing is a standard tool for broadly classifying samples into sequence types, allowing, in many cases, to rule a sample out of an outbreak, or allowing for general characteristics about a bacterial strain to be inferred. Long read sequencing technologies, such as from OPford Nanopore, can produce read data within minutes of an ePperiment starting, unlike short read sequencing technologies which require many hours/days. However, the error rates of raw uncorrected long read data are very high. We present Krocus which can predict a sequence type directly from uncorrected long reads, and which was designed to consume read data as it is produced, providing results in minutes. It is the only tool which can do this from uncorrected long reads. We tested Krocus on over 700 isolates sequenced using long read sequencing technologies from PacBio and OPford Nanopore. It provides sequence types for isolates on average within 90 seconds, with a sensitivity of 94% and specificity of 97% on real sample data, directly from uncorrected raw sequence reads. The software is written in Python and is available under the open source license GNU GPL version 3.

show abstract

Canu: scalable and accurate long-read assembly via adaptivek-mer weighting and repeat separation

Cited by 6,105 publications

References 78 publications

De Novo Assembly of a New Solanum pennellii Accession Using Nanopore Sequencing

De Novo Assembly of a New Solanum pennellii Accession Using Nanopore Sequencing

Fast and memory-efficient noisy read overlapping with KD-trees

Peer Review #2 of "Rapid multi-locus sequence typing direct from uncorrected long reads using Krocus (v0.1)"

Contact Info

Product

Resources

About