Evidence from both GWAS and clinical observation has suggested that certain psychiatric, metabolic, and autoimmune diseases are heterogeneous, comprising multiple subtypes with distinct genomic etiologies and Polygenic Risk Scores (PRS). However, the presence of subtypes within many phenotypes is frequently unknown. We present CLiP (Correlated Liability Predictors), a method to detect heterogeneity in single GWAS cohorts. CLiP calculates a weighted sum of correlations between SNPs contributing to a PRS on the case/control liability scale. We demonstrate mathematically and through simulation that among i.i.d. homogeneous cases generated by a liability threshold model, significant anti-correlations are expected between otherwise independent predictors due to ascertainment on the hidden liability score. In the presence of heterogeneity from distinct etiologies, confounding by covariates, or mislabeling, these correlation patterns are altered predictably. We further extend our method to two additional association study designs: CLiP-X for quantitative predictors in applications such as transcriptome-wide association, and CLiP-Y for quantitative phenotypes, where there is no clear distinction between cases and controls. Through simulations, we demonstrate that CLiP and its extensions reliably distinguish between homogeneous and heterogeneous cohorts when the PRS explains as low as 3% of variance on the liability scale and cohorts comprise 50, 000 − 100, 000 samples, an increasingly practical size for modern GWAS. We apply CLiP to heterogeneity detection in schizophrenia cohorts totaling > 50, 000 cases and controls collected by the Psychiatric Genomics Consortium. We observe significant heterogeneity in mega-analysis of the combined PGC data (p-value 8.54 × 0 −4), as well as in individual cohorts meta-analyzed using Fisher's method (p-value 0.03), based on significantly associated variants. We also apply CLiP-Y to detect heterogeneity in neuroticism in over 10, 000 individuals from the UK Biobank and detect heterogeneity with a p-value of 1.68 × 10 −9. Scores were not significantly reduced when partitioning by known subclusters ("Depression" and "Worry"), suggesting that these factors are not the primary source of observed heterogeneity.
One of the most exciting applications of modern artificial intelligence is to automatically discover scientific laws from experimental data. This is not a trivial problem as it involves searching for a complex mathematical relationship over a large set of explanatory variables and operators that can be combined in an infinite number of ways. Inspired by the incredible success of deep learning in computer vision, we tackle this problem by adapting various successful network architectures into the symbolic law discovery pipeline. The novelty of our approach is in (1) encoding the input data as an image with super-resolution, (2) developing an appropriate deep network pipeline, and (3) predicting the importance of each mathematical operator from the relationship image. This allows us to prior the exponentially large search with the predicted importance of the symbolic operators, which can significantly accelerate the discovery process. We apply our model to a variety of plausible relationships---both simulated and from physics and mathematics domains---involving different dimensions and constituents. We show that our model is able to identify the underlying operators from data, achieving a high accuracy and AUC (91% and 0.96 on average resp.) for systems with as many as ten independent variables. Our method significantly outperforms the current state of the art in terms of data fitting (R^2), discovery rate (recovering the true relationship), and succinctness (output formula complexity). The discovered equations can be seen as first drafts of scientific laws that can be helpful to the scientists for (1) hypothesis building, and (2) understanding the complex underlying structure of the studied phenomena. Our approach holds a real promise to help speed up the rate of scientific discovery.
9Evidence from both GWAS and clinical observation has suggested that certain psychiatric, metabolic, and 10 autoimmune diseases are heterogeneous, comprising multiple subtypes with distinct genomic etiologies and 11 Polygenic Risk Scores (PRS). However, the presence of subtypes within many phenotypes is frequently 12 unknown. We present CLiP (Correlated Liability Predictors), a method to detect heterogeneity in single 13 GWAS cohorts. CLiP calculates a weighted sum of correlations between SNPs contributing to a PRS on 14 the case/control liability scale. We demonstrate mathematically and through simulation that among i.i.d. 15 homogeneous cases, significant anti-correlations are expected between otherwise independent predictors due 16 to ascertainment on the hidden liability score. In the presence of heterogeneity from distinct etiologies, 17 confounding by covariates, or mislabeling, these correlation patterns are altered predictably. We further 18 extend our method to two additional association study designs: CLiP-X for quantitative predictors in 19 applications such as transcriptome-wide association, and CLiP-Y for quantitative phenotypes, where there 20 is no clear distinction between cases and controls. Through simulations, we demonstrate that CLiP and its 21 extensions reliably distinguish between homogeneous and heterogeneous cohorts when the PRS explains as 22 low as 5% of variance on the liability scale and cohorts comprise 50, 000 − 100, 000 samples, an increasingly 23 practical size for modern GWAS. We apply CLiP to heterogeneity detection in schizophrenia cohorts totaling 24 > 50, 000 cases and controls collected by the Psychiatric Genomics Consortium. We observe significant 25 heterogeneity in mega-analysis of the combined PGC data (p-value 8.54e-4), as well as in individual cohorts 26 meta-analyzed using Fisher's method (p-value 0.03), based on significantly associated variants. 27 2 1 Introduction 28In recent years Genome-Wide Association Studies (GWAS) have identified thousands of genomic risk factors 29 and generated insights into disease etiologies and potential treatments [1, 2, 3]. Increasingly, there has been 30 interest in advancing beyond these associations towards obtaining a deeper understanding the mechanisms 31 by which genomic factors influence disease [1, 4]. These require models beyond simply combining linear 32 effects of variants, as they often modulate phenotypes indirectly, though the expression of other genes [5, 6]. 33 One such avenue has concerned the apparent heterogeneity of diseases which has not been sufficiently 34 recognized by GWAS: while individuals in cohorts for these studies are frequently classified simply as cases or 35 controls, clinical evidence for several GWAS traits have suggested that there are multiple different subtypes 36 of diseases consisting of distinct sets of symptoms and association with distinct rare risk alleles [7, 8]. 37For example, polygenic risk scores for major depressive disorder explain more of the phenotypic variance 38 when ca...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.