Polygenic scores (PGS) have been widely used to predict disease risk using variants identified from genome-wide association studies (GWAS). To date, most GWAS have been conducted in populations of European ancestry, which limits the use of GWAS-derived PGS in non-European ancestry populations. Here, we derive a theoretical model of the relative accuracy (RA) of PGS across ancestries. We show through extensive simulations that the RA of PGS based on genome-wide significant SNPs can be predicted accurately from modelling linkage disequilibrium (LD), minor allele frequencies (MAF), cross-population correlations of causal SNP effects and heritability. We find that LD and MAF differences between ancestries can explain between 70 and 80% of the loss of RA of European-based PGS in African ancestry for traits like body mass index and type 2 diabetes. Our results suggest that causal variants underlying common genetic variation identified in European ancestry GWAS are mostly shared across continents.
Genetic correlation is a key population parameter that describes the shared genetic architecture of complex traits and diseases. It can be estimated by current state-of-art methods, i.e., linkage disequilibrium score regression (LDSC) and genomic restricted maximum likelihood (GREML). The massively reduced computing burden of LDSC compared to GREML makes it an attractive tool, although the accuracy (i.e., magnitude of standard errors) of LDSC estimates has not been thoroughly studied. In simulation, we show that the accuracy of GREML is generally higher than that of LDSC. When there is genetic heterogeneity between the actual sample and reference data from which LD scores are estimated, the accuracy of LDSC decreases further. In real data analyses estimating the genetic correlation between schizophrenia (SCZ) and body mass index, we show that GREML estimates based on ∼150,000 individuals give a higher accuracy than LDSC estimates based on ∼400,000 individuals (from combined meta-data). A GREML genomic partitioning analysis reveals that the genetic correlation between SCZ and height is significantly negative for regulatory regions, which whole genome or LDSC approach has less power to detect. We conclude that LDSC estimates should be carefully interpreted as there can be uncertainty about homogeneity among combined meta-datasets. We suggest that any interesting findings from massive LDSC analysis for a large number of complex traits should be followed up, where possible, with more detailed analyses with GREML methods, even if sample sizes are lesser.
The genomics era has brought useful tools to dissect the genetic architecture of complex traits. We propose a reaction norm model (RNM) to tackle genotype-environment correlation and interaction problems in the context of genome-wide association analyses of complex traits. In our approach, an environmental risk factor affecting the trait of interest can be modeled as dependent on a continuous covariate that is itself regulated by genetic as well as environmental factors. Our multivariate RNM approach allows the joint modelling of the relation between the genotype (G) and the covariate (C), so that both their correlation (association) and interaction (effect modification) can be estimated. Hence we jointly estimate genotype-covariate correlation and interaction (GCCI). We demonstrate using simulation that the proposed multivariate RNM performs better than the current state-of-theart methods that ignore G-C correlation. We apply the method to data from the UK Biobank (N= 66,281) in analysis of body mass index using smoking quantity as a covariate. We find a highly significant G-C correlation, but a negligible G-C interaction. In contrast, when a conventional G-C interaction analysis is applied (i.e., G-C correlation is not included in the model), highly significant G-C interaction estimates are found. It is also notable that we find a significant heterogeneity in the estimated residual variances across different covariate levels probably due to residual-covariate interaction. Using simulation we also show that the residual variances estimated by genomic restricted maximum likelihood (GREML) or linkage disequilibrium score regression (LDSC) can be inflated in the presence of interactions, implying that the currently reported SNP-heritability estimates from these methods should be interpreted with caution. We conclude that it is essential to correctly account for both interaction and correlation in complex trait analyses and that the failure to do so may lead to substantial biases in inferences relating to genetic architecture of complex traits, including estimated SNP-heritability.
Background: Polygenic scores (PGSs), which assess the genetic risk of individuals for a disease, are calculated as a weighted count of risk alleles identified in genome-wide association studies (GWASs). PGS methods differ in which DNA variants are included and the weights assigned
Polygenic scores (PGSs), which assess the genetic risk of individuals for a disease, are calculated as a weighted count of risk alleles identified in genome-wide association studies (GWASs). PGS methods differ in terms of which DNA variants are included in the score and the weights assigned to them. PGSs are evaluated in independent target samples of individuals with known disease status. Evaluation of new PGS methods are made using simulated data or single target cohort, however, in real data sets there can be heterogeneity between target sample cohorts, which could reflect a number of real or artefactual factors. The Psychiatric Genomics Consortium working groups for schizophrenia (SCZ) and major depressive disorder (MDD) bring together many independently collected case-control cohorts for GWAS meta-analysis. These resources are used here in repeated application of leave-one-cohort-out GWAS analyses, generating robust conclusions for PGS prediction applied across multiple target (left-out) cohorts. Eight PGS methods (P+T, SBLUP, LDpred-Inf, LDpred-funct, LDpred, PRS-CS, PRS-CS-auto, SBayesR) are compared. We found that SBayesR had the highest prediction evaluation statistics in most comparisons. For SCZ across 30 target cohorts, the SBayesR PGS achieved a mean area under the receiver operator characteristic curve (AUC) of 0.733, and explained 9.9% of variance on the liability scale. For MDD across 26 target cohorts, the AUC and variance explained were 0.601 and 4.0%, respectively. The variance explained by the SBayesR PGS was 46% and 43% higher for SCZ and MDD, respectively, compared to the basic p-value thresholding P+T method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.