The continuous improvement of long-read sequencing technologies along with the development of ad-doc algorithms has launched a new de novo assembly era that promises high-quality genomes.However, it has proven difficult to use only long reads to generate accurate genome assemblies of large, repeat-rich human genomes. To date, most of the human genomes assembled from long error-prone reads add accurate short reads to further polish the consensus quality. Here, we report the development of a novel algorithm for hybrid assembly, WENGAN, and the de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. WENGAN implements efficient algorithms that exploit the sequence information of short and long reads to tackle assembly contiguity as well as consensus quality. The resulting genome assemblies have high contiguity Mb), few assembly errors (contig NGA50:10.9-45.91 Mb), good consensus quality (QV:27.79-33.61), and high gene completeness (BUSCO complete: 94.6-95.1%), while consuming low computational resources (CPU hours:153-1027). In particular, the WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 62.06 Mb (NGA50:45.91 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50:57.88 Mb). Providing highest quality at low computational cost, WENGAN is an important step towards the democratization of the de novo assembly of human genomes. The WENGAN assembler is available at https://github.com/adigenova/wengan 1 Introduction 1 Genome assembly is the process by which an unknown genome sequence is constructed by detect-2 ing overlaps between a set of redundant genomic reads. Most genome assemblers represent the 3 overlap information using different kinds of assembly graphs [1,2]. The main idea behind these 4 algorithms is to reduce the genome assembly problem to a path problem where the genome is re-5 constructed by finding "the" true genome path in a tangled assembly graph [1,2]. The tangledness 6 comes from the complexity that repetitive genomic regions induce in the assembly graphs [1,2]. 7 The first graph-based genome assemblers used overlaps of variable length to construct an overlap-8 graph [2]. In such graph, the reads are the vertices and the edges represent the pairwise align-9 ments [2]. The main goal of the overlap graph approach and of its subsequent evolution, namely 10 the string graph [2], is to preserve as much as possible the reads information [2]. However, the 11 read-level graph construction requires an expensive all-vs-all read comparison [2]. The read-level 12 nature implies that a path in such a graph represents a read layout, and a subsequent consensus step 13 must be performed in order to improve the quality of bases called along the path [2]. These graph 14 properties are the foundation of the overlap-layout-consensus (OLC) paradigm [2][3][4].
15A seemingly counterintuitive idea is to fix the overlap length to a given size (k) to build a 16 de Bruijn gra...