AbstractColocalization analysis has emerged as a powerful tool to uncover the overlapping of causal variants responsible for both molecular and complex disease phenotypes. The findings from colocalization analysis yield insights into the molecular pathways of complex diseases. In this paper, we conduct an in-depth investigation of the promise and limitations of the available colocalization analysis approaches. Focusing on variant-level colocalization approaches, we first establish the connections between various existing methods. We proceed to discuss the impacts of various controllable analytical factors and uncontrollable practical factors on outcomes of colocalization analysis through realistic simulations and real data examples. We identify a single analytical factor, the specification of prior enrichment levels, which can lead to severe inflation of false-positive colocalization findings. Meanwhile, the combination of many other analytical and practical factors all lead to diminished power. Consequently, we recommend the following strategies for the best practice of colocalization analysis: i) estimating prior enrichment level from the observed data; and ii) separating fine-mapping and colocalization analysis. Our analysis of 4,091 complex traits and the multi-tissue eQTL data from the GTEx (version 8) suggests that colocalizations of molecular QTLs and GWAS traits are widespread in many complex traits. However, only a small proportion can be confidently identified from currently available data due to a lack of power. Our findings should serve as an important benchmark for the current and future integrative genetic association analysis applications.
Transcriptome-wide association studies and colocalization analysis are popular computational approaches for integrating genetic association data from molecular and complex traits. They show the unique ability to go beyond variant-level genetic association evidence and implicate critical functional units, e.g., genes, in disease etiology. However, in practice, when the two approaches are applied to the same molecular and complex trait data, the inference results can be markedly different. This paper systematically investigates the inferential reproducibility between the two approaches through theoretical derivation, numerical experiments, and analyses of 4 complex trait GWAS and GTEx eQTL data. We identify two classes of inconsistent inference results. We find that the first class of inconsistent results may suggest an interesting biological phenomenon, i.e., horizontal pleiotropy; thus, the two approaches are truly complementary. The inconsistency in the second class can be understood and effectively reconciled. To this end, we propose a novel approach for locus-level colocalization analysis. We demonstrate that the joint TWAS and locus-level colocalization analysis improves specificity and sensitivity for implicating biological-relevant genes.
Motivation
Gene set enrichment analysis has been shown to be effective in identifying relevant biological pathways underlying complex diseases. Existing approaches lack the ability to quantify the enrichment levels accurately, hence preventing the enrichment information to be further utilized in both upstream and downstream analyses. A modernized and rigorous approach for gene set enrichment analysis that emphasizes both hypothesis testing and enrichment estimation is much needed.
Results
We propose a novel computational method, Bayesian Analysis of Gene Set Enrichment (BAGSE), for gene set enrichment analysis. BAGSE is built on a natural Bayesian hierarchical model and fully accounts for the uncertainty embedded in the association evidence of individual genes. We adopt an empirical Bayes inference framework to fit the proposed hierarchical model by implementing an efficient EM algorithm. Through simulation studies, we illustrate that BAGSE yields accurate enrichment quantification while achieving similar power as the state-of-the-art methods. Further simulation studies show that BAGSE can effectively utilize the enrichment information to improve the power in gene discovery. Finally, we demonstrate the application of BAGSE in analyzing real data from differential expression experiment and Transcriptome-wide Association Study (TWAS). Our results indicate that the proposed statistical framework is effective in aiding the discovery of potentially causal pathways and gene networks
Availability
BAGSE is implemented using the C++ programming language and is freely available from https://github.com/xqwen/bagse/. Simulated and real data used in this paper are also available at the Github repository for reproducibility purposes.
Supplementary information
Supplementary data are available at Bioinformatics online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.