1Association of phenotypes or exposures with genomic and epigenomic data faces 2 important statistical challenges. One of these challenges is to remove variation due to 3 unobserved confounding factors, such as individual ancestry or cell-type composition 4 in tissues. This issue can be addressed with penalized latent factor regression models, 5 where penalties are introduced to cope with high dimension in the data. If a rela-6 tively small proportion of genomic or epigenomic markers correlate with the variable 7 of interest, sparsity penalties may help to capture the relevant associations, but the 8 improvement over non-sparse approaches has not been fully evaluated yet. In this 9 study, we introduced least-squares algorithms that jointly estimate effect sizes and 10 confounding factors in sparse latent factor regression models. Computer simulations 11 provided evidence that sparse latent factor regression models achieve higher statistical 12 performance than other sparse methods, including the least absolute shrinkage and 13 selection operator (LASSO) and a Bayesian sparse linear mixed model (BSLMM). 14 Additional simulations based on real data showed that sparse latent factor regression 15 models were more robust to departure from the generative model than non-sparse 16 approaches, such as surrogate variable analysis (SVA) and other methods. We ap-17 plied sparse latent factor regression models to a genome-wide association study of 18 a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide asso-19 ciation study of smoking status in pregnant women. For both applications, sparse 20 latent factor regression models facilitated the estimation of non-null effect sizes while 21 avoiding multiple testing problems. The results were not only consistent with pre-22 vious discoveries, but they also pinpointed new genes with functional annotations 23 relevant to each application. 24 2 ables, including batch effects, individual ancestry or tissue cell-type composition are 49 integrated in the regression model by using latent factors. In these models, effect sizes 50 and latent factors are estimated jointly. The latent factor regression framework en-51 compasses several methods which include surrogate variable analysis (SVA, Leek and 52 Storey (2007)), latent factor mixed models (LFMM, Frichot et al. (2013)), residual 53 principal component analysis (Kalaitzis and Lawrence, 2012), and confounder ad-54 justed testing and estimation (CATE, Wang et al. (2017)). Each method has specific 55 merits relative to some category of association study, and the performances of the 56 methods have been extensively debated in recent surveys (for example, see Kaushal 57 et al. (2017)).
58A property of many latent factor regression models is to use regularization pa-59 rameters inducing constraints on effect size estimates. Among those methods, sparse 60 regression models suppose that a relatively small proportion of all genomic variables 61 correlate with the variable of interest or affect the phenotype, and ev...