Although the technical and analytic complexity of whole genome sequencing is
generally appreciated, best practices for data cleaning and quality control have not
been defined. Family based data can be used to guide the standardization of specific
quality control metrics in nonfamily based data. Given the low mutation rate,
Mendelian inheritance errors are likely as a result of erroneous genotype calls.
Thus, our goal was to identify the characteristics that determine Mendelian
inheritance errors. To accomplish this, we used chromosome 3 whole genome sequencing
family based data from the Genetic Analysis Workshop 18. Mendelian inheritance errors
were provided as part of the GAW18 data set. Additionally, for binary variants we
calculated Mendelian inheritance errors using PLINK. Based on our analysis, nonbinary
single-nucleotide variants have an inherently high number of Mendelian inheritance
errors. Furthermore, in binary variants, Mendelian inheritance errors are not
randomly distributed. Indeed, we identified 3 Mendelian inheritance error peaks that
were enriched with repetitive elements. However, these peaks can be lessened with the
inclusion of a single filter from the sequencing file. In summary, we demonstrated
that erroneous sequencing calls are nonrandomly distributed across the genome and
quality control metrics can dramatically reduce the number of mendelian inheritance
errors. Appropriate quality control will allow optimal use of genetic data to realize
the full potential of whole genome sequencing.