Novel Methods to Optimize Genotypic Imputation for Low‐Coverage, Next‐Generation Sequence Data in Crop Plants

Swarts, Kelly; Li, Huihui; Navarro, J. Alberto Romero; An, Dong; Romay, M. Cinta; Hearne, Sarah; Acharya, Charlotte B.; Glaubitz, Jeffrey C.; Mitchell, Sharon E.; Elshire, Robert J.; Buckler, Edward S.; Bradbury, Peter J.

doi:10.3835/plantgenome2014.05.0023

Cited by 254 publications

(225 citation statements)

References 42 publications

Supporting

Mentioning

223

Contrasting

Order By: Relevance

“…To evaluate the performance of LB-Impute, both the fraction of data imputed and the accuracy were measured. Results from the LB-Impute analyses were compared with those of FSFHap (Bradbury et al 2007;Swarts et al 2014), a widely used program designed to deal with false homozygosity resulting from incomplete coverage of heterozygous markers. Like LB-Impute, FSFHap is designed specifically to impute biallelic populations.…”

Section: Resultsmentioning

confidence: 99%

“…If the marker is heterozygous in the sample, however, that site will be falsely identified as a homozygote (Swarts et al 2014). Both missing sites and erroneous homozygote calls pose the greatest challenge to the imputation of missing data in low-coverage sequencing data sets.Recently, several algorithms have emerged that impute RRS data, and GBS data sets in particular, generated from plant populations (Huang et al 2014;Swarts et al 2014;Rowan et al 2015). Because GBS relies on both a high degree of multiplexing and reduced representation to maximize efficiency, it is emblematic of both the challenges and opportunities of processing low-coverage sequencing data.…”

mentioning

confidence: 99%

“…In this case, if the site is monomorphic, no information is lost. If the marker is heterozygous in the sample, however, that site will be falsely identified as a homozygote (Swarts et al 2014). Both missing sites and erroneous homozygote calls pose the greatest challenge to the imputation of missing data in low-coverage sequencing data sets.…”

mentioning

confidence: 99%

“…Recently, several algorithms have emerged that impute RRS data, and GBS data sets in particular, generated from plant populations (Huang et al 2014;Swarts et al 2014;Rowan et al 2015). Because GBS relies on both a high degree of multiplexing and reduced representation to maximize efficiency, it is emblematic of both the challenges and opportunities of processing low-coverage sequencing data.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Imputing Genotypes in Biallelic Populations from Low-Coverage Sequence Data

et al. 2015

View full text Add to dashboard Cite

Low-coverage next-generation sequencing methodologies are routinely employed to genotype large populations. Missing data in these populations manifest both as missing markers and markers with incomplete allele recovery. False homozygous calls at heterozygous sites resulting from incomplete allele recovery confound many existing imputation algorithms. These types of systematic errors can be minimized by incorporating depth-of-sequencing read coverage into the imputation algorithm. Accordingly, we developed Low-Coverage Biallelic Impute (LB-Impute) to resolve missing data issues. LB-Impute uses a hidden Markov model that incorporates marker read coverage to determine variable emission probabilities. Robust, highly accurate imputation results were reliably obtained with LB-Impute, even at extremely low (,13) average per-marker coverage. This finding will have implications for the design of genotype imputation algorithms in the future. LB-Impute is publicly available on GitHub at https://github.com/dellaportalaboratory/LB-Impute.KEYWORDS hidden Markov models; imputation; next-generation sequencing; population genetics; plant genomics T HE imputation of missing genotype data has been a key research topic in statistical genetics since well before the advent of next-generation sequencing (NGS) technologies. The goal of many of these algorithms was to reconstruct haplotypes from Sanger or microarray-based genotyping, usually on human populations. Strategies employing the expectation-maximization algorithm (Hawley and Kidd 1995;Long et al. 1995;Qin et al. 2002;Scheet and Stephens 2006), Bayesian inference Stephens and Donnelly 2003), or Markovian methodology (Stephens et al. 2001;Broman et al. 2003;Broman and Sen 2009), local ancestry and gametic phase, could be used to resolve missing markers within a population (Browning and Browning 2011). In these cases, missing genotypes were assigned based on the most likely proximal haplotypes. These computational methods greatly increased the informative content of genotyping information, especially for population studies (Spencer et al. 2009;Cleveland et al. 2011). While these programs were powerful and accurate, they also could be computationally expensive. Further, they assumed that available genotypes were largely correct, which could cause issues with sequencing data sets.The development of programs that focused primarily on the imputation of missing data and haplotype phasing was likely motivated by several factors. Genome-wide association studies could be enhanced by the inference of additional markers using large multipopulation data sets such as the International HapMap Project (International HapMap Consortium et al. 2010). The emergence of the meta-analysis led to a need for algorithms that could merge disparate data sets Howie et al. 2009;Li et al. 2010;Liu et al. 2013;Fuchsberger et al. 2015). These algorithms often employed large haplotype reference panels to improve imputation (Marchini et al. 2007;Browning and Browning 2009;Howie et al. 2009). In bialleli...

show abstract

Section: Resultsmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Imputing Genotypes in Biallelic Populations from Low-Coverage Sequence Data

et al. 2015

View full text Add to dashboard Cite

show abstract

“…An iterative process of imputation and linkage mapping was conducted to produce a final consensus linkage map with complete map scores at 7386 pseudomarkers with a uniform resolution of 0.2 cM per marker (Swarts et al 2014;Ogut et al 2015).…”

Section: Domestication-related Traits In Maize 101mentioning

confidence: 99%

Genetic Architecture of Domestication-Related Traits in Maize

et al. 2016

View full text Add to dashboard Cite

Strong directional selection occurred during the domestication of maize from its wild ancestor teosinte, reducing its genetic diversity, particularly at genes controlling domestication-related traits. Nevertheless, variability for some domestication-related traits is maintained in maize. The genetic basis of this could be sequence variation at the same key genes controlling maize-teosinte differentiation (due to lack of fixation or arising as new mutations after domestication), distinct loci with large effects, or polygenic background variation. Previous studies permit annotation of maize genome regions associated with the major differences between maize and teosinte or that exhibit population genetic signals of selection during either domestication or postdomestication improvement. Genome-wide association studies and genetic variance partitioning analyses were performed in two diverse maize inbred line panels to compare the phenotypic effects and variances of sequence polymorphisms in regions involved in domestication and improvement to the rest of the genome. Additive polygenic models explained most of the genotypic variation for domesticationrelated traits; no large-effect loci were detected for any trait. Most trait variance was associated with background genomic regions lacking previous evidence for involvement in domestication. Improvement sweep regions were associated with more trait variation than expected based on the proportion of the genome they represent. Selection during domestication eliminated large-effect genetic variants that would revert maize toward a teosinte type. Small-effect polygenic variants (enriched in the improvement sweep regions of the genome) are responsible for most of the standing variation for domestication-related traits in maize.KEYWORDS quantitative trait loci; nested association mapping; genome-wide association study; variance components; Zea mays T HE domestication of all major crop plants occurred in a relatively short period in human history, starting 10,000 years ago (Harlan 1992). During the domestication process, seeds of preferred forms were selected and saved to plant subsequent generations. Some alleles favored under domestication may have been neutral or even deleterious for the survival of wild plant species; for example, seed shattering promotes seed dispersal in wild grasses, but alleles for nondisarticulating seed structures were strongly selected for under domestication (Galinat 1983). Consequently, rare alleles favorable for growth and development under agricultural conditions or for traits desired by humans increased in frequency, often reaching fixation and reducing genetic variation very near causal sequence sites (Wang et al. 1999). In addition, domestication was often accompanied by severe genetic bottlenecks from the use of small founder populations. The reduction in effective population sizes also resulted in reduced genetic diversity genome-wide. Population genetics methods to model the strength and duration of bottlenecks provide a means to ...

show abstract