Background
The allo-octoploid F. x ananassa consistently follows a disomic inheritance. Therefore diploid variant calling pipelines can be followed but due to the high similarity among its subgenomes, there is an increased error rate for these variants. Especially when aligning short sequencing reads (150bp) to a reference genome, reads could be aligned on the wrong subgenome, resulting in erroneous variants. It is important to know which subgenome is important for a desired phenotypic value of a particular trait and filtering out these erroneous variants decreases the chance that a wrong subgenome is traced for certain traits. To mitigate the problem, we first need to classify variants in different categories: correct variants (type 1), and two erroneous variant types: homoeologous variants (type 2), and multi-locus variants (type 3).
Results
Erroneous variant types (type 2 and 3) often have skewed average allele balances (of heterozygous calls), but not always. So, the average allele balance of heterozygous variants is not sufficient to tag all erroneous variants in F. x ananassa. Not identified erroneous variants were further checked by an LD-based method in a diversity panel. This method predicted variant types with 99% similarity to a method utilizing a genetic map from a biparental mapping population that was used for validation of the method. The effect of the filtering methods on phasing accuracy was assessed by using SHAPEIT5 for phasing, and the lowest switch error rate (0.037) was obtained by a combination of LD-based and average allele balance filtering although the addition of the latter only improved the switch error rate slightly. This indicates that the LD-based method tags most erroneous variants with a skewed average allele balance and also other erroneous variants. The dataset resulting from the best filtering method (LD-based + AAB) had a 44% lower switch error rate than the original dataset and retained 72% of the overall variants.
Conclusions
In conclusion, erroneous variants that arise from high sequence similarity in allopolyploids could be identified without the need for genotyping many mapping populations. This LD-based filtering method improved phasing accuracy and ensures that important alleles are better traceable through the germplasm.