The next-generation sequencing (NGS) revolution has drastically reduced time and cost requirements for sequencing of large genomes, and also qualitatively changed the problem of assembly. This article reviews the state of the art in de novo genome assembly, paying particular attention to mammalian-sized genomes. The strengths and weaknesses of the main sequencing platforms are highlighted, leading to a discussion of assembly and the new challenges associated with NGS data. Current approaches to assembly are outlined and the various software packages available are introduced and compared. The question of whether quality assemblies can be produced using short-read NGS data alone, or whether it must be combined with more expensive sequencing techniques, is considered. Prospects for future assemblers and tests of assembly performance are also discussed.
Keywordsde novo assembly; genomics; next-generation sequencing; whole-genome shotgun Genome assembly continues to be one of the central problems of bioinformatics. This is owing, in large part, to the continuing development of the sequencing technology that provides 'reads' of short sequences of DNA, from which the genome is inferred. Larger sets of data, and changes in the properties of reads such as length and errors, bring with them new challenges for assembly. For the earliest sequencing efforts using the whole-genome shotgun (WGS) approach, in which reads are generated from random locations across the entire genome, assembly could be dealt with by arranging print-outs of the reads by hand. Through the next three decades, Sanger capillary sequencing gained substantially in throughput, and WGS became practical for increasingly large and complex genomes, from tens of kilobases in the early 1980s to gigabases by 2001 [1]. In line with this, assembly went on to use not only increasingly powerful computational means, but also increasingly time and memory-efficient assemblers.A further revolution in sequencing began around 2005, when second-generation sequencing (SGS) technologies began to produce massive throughput at far lower costs than Sanger sequencing, enabling a mammalian genome to be sequenced in a matter of days [2]. De novo assemblies of the Panda [3] and Turkey [4] genomes have now been made using SGS data alone, and several human resequencing projects have been completed [5][6][7]. The © The Wellcome Trust Sanger Institute * Author for correspondence: Tel.: +44 1223 494705, Fax: +44 1223 494919, zn1@sanger.ac.uk.
Financial & competing interests disclosureThe authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed. No writing assistance was utilized in the production of this manuscript. Assembly is not at all a trivial task. Repeated sequences of DNA make it difficult to infer the relative positions in the genome corresponding to reads, and they occur far more often...