Gene-environment association (GEA) studies are essential to understand the past and ongoing adaptations of organisms to their environment, but those studies are complicated by confounding due to unobserved demographic factors. Although the confounding problem has recently received considerable attention, the proposed approaches do not scale with the high-dimensionality of genomic data. Here, we present a new estimation method for latent factor mixed models (LFMMs) implemented in an upgraded version of the corresponding computer program. We developed a least-squares estimation approach for confounder estimation that provides a unique framework for several categories of genomic data, not restricted to genotypes. The speed of the new algorithm is several order faster than existing GEA approaches and then our previous version of the LFMM program. In addition, the new method outperforms other fast approaches based on principal component or surrogate variable analysis. We illustrate the program use with analyses of the 1000 Genomes Project data set, leading to new findings on adaptation of humans to their environment, and with analyses of DNA methylation profiles providing insights on how tobacco consumption could affect DNA methylation in patients with rheumatoid arthritis. Software availability: Software is available in the R package lfmm at https://bcm-uga.github.io/lfmm/.
1Association of phenotypes or exposures with genomic and epigenomic data faces 2 important statistical challenges. One of these challenges is to remove variation due to 3 unobserved confounding factors, such as individual ancestry or cell-type composition 4 in tissues. This issue can be addressed with penalized latent factor regression models, 5 where penalties are introduced to cope with high dimension in the data. If a rela-6 tively small proportion of genomic or epigenomic markers correlate with the variable 7 of interest, sparsity penalties may help to capture the relevant associations, but the 8 improvement over non-sparse approaches has not been fully evaluated yet. In this 9 study, we introduced least-squares algorithms that jointly estimate effect sizes and 10 confounding factors in sparse latent factor regression models. Computer simulations 11 provided evidence that sparse latent factor regression models achieve higher statistical 12 performance than other sparse methods, including the least absolute shrinkage and 13 selection operator (LASSO) and a Bayesian sparse linear mixed model (BSLMM). 14 Additional simulations based on real data showed that sparse latent factor regression 15 models were more robust to departure from the generative model than non-sparse 16 approaches, such as surrogate variable analysis (SVA) and other methods. We ap-17 plied sparse latent factor regression models to a genome-wide association study of 18 a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide asso-19 ciation study of smoking status in pregnant women. For both applications, sparse 20 latent factor regression models facilitated the estimation of non-null effect sizes while 21 avoiding multiple testing problems. The results were not only consistent with pre-22 vious discoveries, but they also pinpointed new genes with functional annotations 23 relevant to each application. 24 2 ables, including batch effects, individual ancestry or tissue cell-type composition are 49 integrated in the regression model by using latent factors. In these models, effect sizes 50 and latent factors are estimated jointly. The latent factor regression framework en-51 compasses several methods which include surrogate variable analysis (SVA, Leek and 52 Storey (2007)), latent factor mixed models (LFMM, Frichot et al. (2013)), residual 53 principal component analysis (Kalaitzis and Lawrence, 2012), and confounder ad-54 justed testing and estimation (CATE, Wang et al. (2017)). Each method has specific 55 merits relative to some category of association study, and the performances of the 56 methods have been extensively debated in recent surveys (for example, see Kaushal 57 et al. (2017)). 58A property of many latent factor regression models is to use regularization pa-59 rameters inducing constraints on effect size estimates. Among those methods, sparse 60 regression models suppose that a relatively small proportion of all genomic variables 61 correlate with the variable of interest or affect the phenotype, and ev...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.