Assembly of long, error-prone reads using repeat graphs

Kolmogorov, Mikhail; Yuan, Jeffrey; Lin, Yu; Pevzner, Pavel A.

doi:10.1038/s41587-019-0072-8

Cited by 3,682 publications

(3,016 citation statements)

References 42 publications

Supporting

Mentioning

3,006

Contrasting

Unclassified

Order By: Relevance

“…A total of 47 Gbp (~99x coverage) of sequencing data was generated on the Pacific Biosystems Sequel sequencing machine. Flye version 2.3.3 (Kolmogorov et al 2019) was run on the PacBio sequencing data specifying a genome size of 700 Mbp (which is between 0.5x and 2.0x of expected genome size of related marine fishes; http://www.genomesize.com/) and otherwise default options. We ran BUSCO version 3.0.1 to assess genome assembly completeness (Simão et al 2015).…”

Section: Genome Sequencing and Draft Assemblymentioning

confidence: 99%

Standing genetic variation and chromosomal rearrangements facilitate local adaptation in a marine fish

Cayuela¹,

Rougemont²,

Laporte³

et al. 2019

Preprint

View full text Add to dashboard Cite

Population genetic theory states that adaptation most frequently occurs from standing genetic variation, which results from the interplay between different evolutionary processes including mutation, chromosomal rearrangements, drift, gene flow and selection. To date, empirical work focusing on the contribution of standing genetic variation to local adaptation in the presence of high gene flow has been limited to a restricted number of study systems. Marine organisms are excellent biological models to address this issue since many species have to cope with variable environmental conditions acting as selective agents despite high dispersal abilities. In this study, we examined how, demographic history, standing genetic variation linked to chromosomal rearrangements and shared polymorphism among glacial lineages contribute to local adaptation to environmental conditions in the marine fish, the capelin (Mallotus villosus). We used a comprehensive dataset of genome-wide single nucleotide polymorphisms (25,904 filtered SNPs) genotyped in 1,359 individuals collected from 31 spawning sites in the northwest Atlantic (North America and Greenland waters). First, we reconstructed the history of divergence among three glacial lineages and showed that they diverged from 3.8 to 1.8 MyA. Depending on the pair of lineages considered, historical demographic modelling provided evidence for divergence with gene flow and secondary contacts, shaped by barriers to gene flow and linked selection. We next identified candidate loci associated with reproductive isolation of these lineages. Given the absence of physical or geographic barriers, we thus propose that these lineages may represent three cryptic species of capelin. Within each of these, our analyses provided evidence for large ܰ and high gene flow at both historical and contemporary time scales among spawning sites.Furthermore, we detected a polymorphic chromosomal rearrangement leading to the coexistence of three haplogroups within the Northwest Atlantic lineage, but absent in the other two clades.Genotype-environment associations revealed molecular signatures of local adaptation to environmental conditions prevailing at spawning sites. Altogether, our study shows that standing genetic variation associated with both chromosomal rearrangements and ancestral polymorphism contribute to local adaptation in the presence of high gene flow.

show abstract

Section: Genome Sequencing and Draft Assemblymentioning

confidence: 99%

Standing genetic variation and chromosomal rearrangements facilitate local adaptation in a marine fish

Cayuela¹,

Rougemont²,

Laporte³

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

“…canariae NCTC 14382 T was previously sequenced by an Illumina HiSeq 2500 at Public Health England using the Nextera XP library preparation kit following a retrospective study on yersiniosis isolates cultured from patients between April 2004 and March 2018 (8 For ONT MinION data, the run metrics were inspected using NanoPlot (version 1.0) (14) before raw FAST5 files were base-called using Guppy (version 3.2.2) with the high accuracy model to FASTQ files. Adapters were trimmed from the raw reads by Porechop (version 0.2.4) using default parameters for SQK-RAD004 before the genome was de novo assembled with Flye (version 2.5) (15,16). The best assembly parameters were empirically determined to include the option flags "meta" and "plasmid" with coverage reduced to 30X for initial contig assembly based on a predicted genome size of~4.73 Mbp as informed by de novo assembly of short read Illumina data (17).…”

Section: Genome Featuresmentioning

confidence: 99%

Yersinia canariaesp. nov., isolated from a human yersiniosis case

Nguyen

Greig

Hurley

et al. 2019

Preprint

View full text Add to dashboard Cite

A Gram-negative rod from the Yersinia genus was isolated from a clinical case of yersiniosis in the United Kingdom. Long read sequencing data from an Oxford Nanopore Technology (ONT) MinION in conjunction with Illumina HiSeq reads were used to generate a finished quality genome of this strain. Overall Genome Related Index (OGRI) of the strain was used to determine that it was a novel species within Yersinia , despite biochemical similarities to Yersinia enterocolitica. The 16S ribosomal RNA gene accessions are MN434982-MN434987 and the accession number for the complete and closed chromosome is CP043727. The type strain is CFS3336 T (= NCTC 14382 T / =LMG Accession under process ).

show abstract

“…minimizers [18], 15 homopolymers compressed k-mers [14], minhash [17] etc.). The reduced long-read representation 16is appropriate for detecting overlaps >2kb in a fast way [14,16,17]. The newest long-read assem-17 blers are therefore starting to be good also at goal 3 [14,16,17].…”

mentioning

confidence: 99%

“…The reduced long-read representation 16is appropriate for detecting overlaps >2kb in a fast way [14,16,17]. The newest long-read assem-17 blers are therefore starting to be good also at goal 3 [14,16,17]. However, assembling uncorrected 18 long-reads has the undesirable effect of giving more work to the consensus polisher [15,17,[19][20][21].…”

mentioning

confidence: 99%

“…Usually, 21 long-read assemblers perform a single round of long-read polishing [14,16,17], that is followed by 22 several rounds of polishing with long [15,17,19,21] and short [15,20,22] reads using third-party 23 tools [15,17,[19][20][21][22]. 24 Currently, polishing large genomes, such as the human genome, can take much more com-25 putational time than the long-read assembly itself [14,16,17]. Moreover, there is no standard 26 practice for polishing large genomes, and usually several rounds of polishing are employed with 27 user-defined criteria in order to remove consensus errors, notably the short-indels occurring in 28 homopolymer regions, which are the characteristic error signature of current long-read technolo-29 3 gies [14,16,17].…”

mentioning

confidence: 99%

See 1 more Smart Citation

WENGAN: Efficient and high quality hybrid de novo assembly of human genomes

Buena-Atienza

Ossowski

et al. 2019

Preprint

View full text Add to dashboard Cite

The continuous improvement of long-read sequencing technologies along with the development of ad-doc algorithms has launched a new de novo assembly era that promises high-quality genomes.However, it has proven difficult to use only long reads to generate accurate genome assemblies of large, repeat-rich human genomes. To date, most of the human genomes assembled from long error-prone reads add accurate short reads to further polish the consensus quality. Here, we report the development of a novel algorithm for hybrid assembly, WENGAN, and the de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. WENGAN implements efficient algorithms that exploit the sequence information of short and long reads to tackle assembly contiguity as well as consensus quality. The resulting genome assemblies have high contiguity Mb), few assembly errors (contig NGA50:10.9-45.91 Mb), good consensus quality (QV:27.79-33.61), and high gene completeness (BUSCO complete: 94.6-95.1%), while consuming low computational resources (CPU hours:153-1027). In particular, the WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 62.06 Mb (NGA50:45.91 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50:57.88 Mb). Providing highest quality at low computational cost, WENGAN is an important step towards the democratization of the de novo assembly of human genomes. The WENGAN assembler is available at https://github.com/adigenova/wengan 1 Introduction 1 Genome assembly is the process by which an unknown genome sequence is constructed by detect-2 ing overlaps between a set of redundant genomic reads. Most genome assemblers represent the 3 overlap information using different kinds of assembly graphs [1,2]. The main idea behind these 4 algorithms is to reduce the genome assembly problem to a path problem where the genome is re-5 constructed by finding "the" true genome path in a tangled assembly graph [1,2]. The tangledness 6 comes from the complexity that repetitive genomic regions induce in the assembly graphs [1,2]. 7 The first graph-based genome assemblers used overlaps of variable length to construct an overlap-8 graph [2]. In such graph, the reads are the vertices and the edges represent the pairwise align-9 ments [2]. The main goal of the overlap graph approach and of its subsequent evolution, namely 10 the string graph [2], is to preserve as much as possible the reads information [2]. However, the 11 read-level graph construction requires an expensive all-vs-all read comparison [2]. The read-level 12 nature implies that a path in such a graph represents a read layout, and a subsequent consensus step 13 must be performed in order to improve the quality of bases called along the path [2]. These graph 14 properties are the foundation of the overlap-layout-consensus (OLC) paradigm [2][3][4]. 15A seemingly counterintuitive idea is to fix the overlap length to a given size (k) to build a 16 de Bruijn gra...

show abstract

Assembly of long, error-prone reads using repeat graphs

Cited by 3,682 publications

References 42 publications

Standing genetic variation and chromosomal rearrangements facilitate local adaptation in a marine fish

Standing genetic variation and chromosomal rearrangements facilitate local adaptation in a marine fish

Yersinia canariaesp. nov., isolated from a human yersiniosis case

WENGAN: Efficient and high quality hybrid de novo assembly of human genomes

Contact Info

Product

Resources

About