Next-generation sequencing (NGS) is a key technology in understanding the causes and consequences of human genetic variability. In this context, the validity of NGSinferred single-nucleotide variants (SNVs) is of paramount importance. We therefore developed a statistical framework to assess the fidelity of three common NGS platforms and to estimate the proportion of false-positives heterozygotes based on read distributions. Application of this framework to aligned DNA sequence data from two completely sequenced HapMap samples as included in the 1000 Genomes Project revealed remarkably different error profiles for the three platforms. Newly identified SNVs showed consistently higher proportions of false positives (3-17%) when compared to confirmed HapMap variants. We show that this increase was not due to differences in flanking sequence features, read coverage or quality, nor was this observation limited to a particular data set or variant calling algorithm. Consensus calling by more than one platform yielded significantly lower error rates (1-4%). This implies that the use of multiple NGS platforms may be more cost-efficient than relying upon a single technology alone, particularly in physically localized sequencing experiments that rely upon small error rates. Our study thus highlights that different NGS platforms suit different practical applications differently well. Resequencing studies are still more expensive than GWAS on a per-subject basis for accurately calling individual-level genotypes. As a result, subsets of subjects from a larger study often serve as the resequencing sample. To investigate the entire genome the choice of subjects should maximize the number of ancestral lineages to avoid redundant regions that were inherited identical by descent (IBD) from a common ancestor. We present SampleSeq2 (SS2) a greedy algorithm which can select a subset of optimally unrelated subjects, estimate the number of independent chromosomes, G T , or select the minimum number of subjects with a target G T . We evaluated SS2 compared to a random draw by simulation and using the Amish study of Successful Aging. Comparing the known value of G T from simulation to the estimate of G T , the estimate was close to the true value of G T , and SS2 increased G T relative to a random draw across a range of sample sizes. There were 4995 subjects in the full Amish pedigree with 827 in the aging study. We compared SS2 with random selection for K subjects (K=50, 100). For K=50, average G T was 41.5 using SS2 and 29.7 for random selection. On average, SS2 resulted in 39% more independent genomes. For K=100 the average G T was 60.6 for SS2 and 39.9 for random selection, 52% more independent genomes. Increasing chromosomes provides a no cost improvement in power, mitigates effects of relatedness on parameter estimates, and increases the yield of alleles from resequencing.
On Study Designs For Identification Of Rare Disease Variants In Complex DiseasesIuliana Ionita-Laza (1) Ruth Ottman (1) (1) Columbia UniversityThe recent progress ...