AbstractHeterogeneity in definition and measurement of complex diseases in Genome-Wide Association Studies (GWAS) may lead to misdiagnoses and misclassification errors that can significantly impact discovery of disease loci. While well appreciated, almost all analyses of GWAS data consider reported disease phenotype values as is without accounting for potential misclassification. Here, we introduce Phenotype Latent variable Extraction of disease misdiagnosis (PheLEx), a GWAS analysis framework that learns and corrects misclassified phenotypes using structured genotype associations within a dataset. PheLEx consists of a hierarchical Bayesian latent variable model, where inference of differential misclassification is accomplished using filtered genotypes while implementing a full mixed model to account for population structure and genetic relatedness in study populations. Through simulations, we show that the PheLEx framework dramatically improves recovery of the correct disease state when considering realistic allele effect sizes compared to existing methodologies designed for Bayesian recovery of disease phenotypes. We also demonstrate the potential of PheLEx for extracting new candidate loci from existing GWAS data by analyzing epilepsy and bipolar disorder phenotypes available from the UK Biobank dataset, where we identify new candidate disease loci not previously reported for these datasets that have biological connections to the disease phenotypes and/or were identified in independent GWAS. In the discussion, we consider both the broader consequences and importance of careful interpretation of misclassification correction in GWAS phenotypes, as well as potential of PheLEx for re-analyzing existing GWAS data to make novel discoveries.Author SummaryPrevalent misdiagnosis of diseases due to lack of understanding and/or gold-standard diagnostic measures can impact any analytics that follow. These misdiagnosis errors are especially significant in the domain of psychiatric or psychological disorders where the definition of disease and/or their diagnostic tools are always in flux or under further improvement. Here, we propose a method to extract misdiagnosis from disease and infer the correct disease phenotype. We examined the performance of this method on rigorous simulations and real disease phenotypes obtained from the UK Biobank database. We found that this method successfully recovered misdiagnosed individuals in simulations using a carefully designed hierarchical Bayesian latent variable model framework. For real disease phenotypes, epilepsy and bipolar disorder, this method not only suggested an alternate phenotype but results from this method were also used to discover new genomic loci that have been previously showed to be associated with the respective phenotypes, suggesting that this method can be further used to reanalyze large-scale genetic datasets to discover novel loci that might be ignored using traditional methodologies.