Sequencing has revolutionized biology by permitting the analysis of genomic variation at an unprecedented resolution. Highthroughput sequencing is fast and inexpensive, making it accessible for a wide range of research topics. However, the produced data contain subtle but complex types of errors, biases and uncertainties that impose several statistical and computational challenges to the reliable detection of variants. To tap the full potential of high-throughput sequencing, a thorough understanding of the data produced as well as the available methodologies is required. Here, I review several commonly used methods for generating and processing next-generation resequencing data, discuss the influence of errors and biases together with their resulting implications for downstream analyses and provide general guidelines and recommendations for producing high-quality single-nucleotide polymorphism data sets from raw reads by highlighting several sophisticated reference-based methods representing the current state of the art.
INTRODUCTIONSequencing has entered the scientific zeitgeist and the demand for novel sequences has never been greater, with applications spanning comparative genomics, clinical diagnostics and metagenomics, as well as agricultural, evolutionary, forensic and medical genetic studies. Several hundred species have already been completely sequenced (among them reference genomes for human and numerous major model organisms) (Pagani et al., 2012), shifting the focus of many scientific studies toward resequencing analyses in order to identify and catalog genetic variation in different individuals of a population. These population-scale studies permit insights into natural variation, inference on the demographic and selective history of a population, and variants identified in phenotypically distinct individuals can be used to dissect the relationship between genotype and phenotype.Until 2005, Sanger sequencing was the dominant technology, but it was prohibitively expensive and time consuming to routinely perform sequencing on a scale required to reach the scientific goals of many modern research projects. For these reasons, several massively parallel, high-throughput 'next-generation' sequencing (NGS) technologies have since been developed, permitting the analyses of genomes and their variation by being hundreds of times faster and over a thousand times cheaper than traditional Sanger sequencing (Metzker, 2010). Whereas the major limiting factor in the Sanger sequencing era was the experimental production of sequence data, data generation using NGS platforms is straightforward, shifting the bottleneck to downstream analyses, with computational costs often surpassing those of data production (Mardis, 2010). In fact, a multitude of bioinformatic algorithms is necessary to efficiently analyze the generated data sets and to answer biologically relevant questions. Thereby, the large amount of data with shorter read lengths, higher per-base error rates