26High-quality genotypic data is a requirement for many genetic analyses. For any crop, errors in genotype 27 calls, phasing of markers, linkage maps, pedigree records, and unnoticed variation in ploidy levels can 28 lead to spurious marker-locus-trait associations and incorrect origin assignment of alleles to individuals.
29High-throughput genotyping requires automated scoring, as manual inspection of thousands of scored 30 loci is too time-consuming. However, automated SNP scoring can result in errors that should be 31 corrected to ensure recorded genotypic data are accurate and thereby ensure confidence in 32 downstream genetic analyses. To enable quick identification of errors in a large genotypic data set, we 33 have developed a comprehensive workflow. This multiple-step workflow is based on inheritance 34 principles and on removal of markers and individuals that do not follow these principles, as 35 demonstrated here for apple, peach, and sweet cherry. Genotypic data was obtained on pedigreed 36 germplasm using 6-9K SNP arrays for each crop and a subset of well-performing SNPs was created using 37 ASSIsT. Use of correct (and corrected) pedigree records readily identified violations of simple inheritance 38 principles in the genotypic data, streamlined with FlexQTL TM software. Retained SNPs were grouped into 39 haploblocks to increase the information content of single alleles and reduce computational power 40 needed in downstream genetic analyses. Haploblock borders were defined by recombination locations 41 detected in ancestral generations of cultivars and selections. Another round of inheritance-checking was 42 conducted, for haploblock alleles (i.e., haplotypes). High-quality genotypic data sets were created using 43 this workflow for pedigreed collections representing the U.S. breeding germplasm of apple, peach, and 44 sweet cherry evaluated within the RosBREED project. These data sets contain 3855, 4005, and 1617 45 SNPs spread over 932, 103, and 196 haploblocks in apple, peach, and sweet cherry, respectively. The 46 highly curated phased SNP and haplotype data sets, as well as the raw iScan data, of germplasm in the 47 apple, peach, and sweet cherry Crop Reference Sets is available through the Genome Database for 48 Rosaceae. 3 49 50 103 Manual p17 [34]). The presence of one or more additional SNPs, insertions, or deletions in the probe-104 binding region can lead to reduced or loss of binding affinity for the SNP's probe and thereby to the 105 presence of additional clusters, both of which can lead to incorrect genotype scoring of some SNPs [33]. 106 107 No systematic workflow exists to efficiently detect and resolve all types of errors from a genotypic data 108 set for pedigreed germplasm. Methods and software exist to tackle specific types of errors. For example, 109 the ASSIsT software was developed for use with Illumina Infinium® arrays to identify which SNPs show 110 robust results, which SNPs might have genotype calling errors due to alleles with reduced affinity or null 111 alleles, and which...