Motivation: Much of research in genome-wide association studies has only searched for significantly associated signals without explicitly removing unwanted source of variation.Confounder correction is a necessary step to reveal causal effects, but often skipped in a summary-based analysis. Results: We present a novel causal inference algorithm that controls unwanted sources in genetic variance and covariance estimation tasks. We demonstrate substantially improved statistical power and accuracy in extensive simulations. In real-world applications on the UK biobank summary statistics data, our method recapitulates well-known pleiotropic modules, suggesting new insights into biobank-scale GWAS analysis. Contact: YP (ypp@mit.edu) and MK (manoli@mit.edu) Pleiotropy is pervasive across multiple types of human traits, it is no longer expected to have a single GWAS variant fully committed to a single trait. For instance, genetic variants near PCSK9 gene are widely associated with many different human traits, including lipid metabolism, cardiovascular disorders, type 2 diabetes, and Alzheimer's disease [10]. Pleiotropic patterns can emerge for many reasons. Underlying regulatory and metabolic pathways are commonly perturbed by genetic variants. Or, we may simply observe them because definitions of human traits are redundant and elusive. Yet, by knowing genetic underpinnings of pleiotropic patterns, we can improve our predictions of potential adverse and beneficial side effects of drugs, and even refine definitions of human traits.Calculating genetic variance and genetic covariance between traits is perhaps a foremost important step in multi-trait GWAS analysis as they directly measure polygenicity and pleiotropy, respectively.By locally calculating them, we improve our resolution. We establish a set of causally associated traits in comparison with many related traits in biobank, and uncover novel comorbidity networks with clear conviction of relevant genomic locations.b. Problem definition We focus on estimating these second-order statistics from summary data.Existing summary-based methods, agnostic to data generation process, are unable to characterize and adjust biases introduced by non-genetic effects. Most methods inevitably depend on hard-coded assumptions and only address special cases of confoundedness [13,32,38,39, 45]. However, we are concerned that a substantial proportion of estimated genetic variability may contain contributions from non-genetic effects, such as cryptic relatedness [6]. Especially, for cross-trait analysis on a single cohort, such as UK Biobank, where samples are inevitably shared, we are even more concerned that traits are easily confounded by uncharacterized effects.
II. APPROACHIn this work, we present a novel causal inference method, RUV-z (Removing Unwanted Variation in GWAS z-score matrix), with which we characterize undesired sources of information lurking in summary statistics, and selectively remove them to improve accuracy and statistical power of local variance/covariance calculat...