Regularization Methods for High-Dimensional Instrumental Variables Regression With an Application to Genetical Genomics

Lin, Wei; Feng, Rui; Li, Hongzhe

doi:10.1080/01621459.2014.908125

Cited by 72 publications

(100 citation statements)

References 58 publications

Supporting

Mentioning

100

Contrasting

Order By: Relevance

“…At first, we were confused with the above result because Lin et al . () reported that the predicted transcriptome obtained with genetic variants as IV significantly improved phenotype predictions. Soon afterwards, we noticed that only parts of transcripts were predictable with relatively high predictability (Figure a), and we then proposed to use the predicted values of ‘genetically predictable genes (GPGs)’ of the first layer, denoted by PT.1L.GPGs, to predict phenotypes.…”

Section: Resultsmentioning

confidence: 97%

“…Lin et al . () used a two‐stage least squares (2SLS) method to choose an optimal sparse subset of β 0 and Γ 0 . Through strict mathematical derivation and a large scale of simulation test, they claimed that the 2SR method was reliable and powerful for genomic prediction.…”

Section: Methodsmentioning

confidence: 99%

“…Here, we proposed such an integration strategy, that is multilayered least absolute shrinkage and selection operator (MLLASSO), for improving GP. The key idea of MLLASSO is to implement an innovative directed learning strategy that allows us to learn three layers of genetic features (we denote them as 'GFs.1L', 'GFs.2L' and 'GFs.3L' in the rest of this study) supervised by transcriptomic and metabolomic data using genetic variants as instrumental variables (IV), which has been proven to be an efficient statistical technique to select and estimate optimal instruments (Lin et al, 2015). Our approach is still GP because it only requires genomic markers as the input data, but it differs from the traditional GP in that it integrates transcriptomic and metabolic information into a single model and may capture higher order information of gene interactions.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A directed learning strategy integrating multiple omic data improves genomic prediction

Xie

et al. 2019

Plant Biotechnology Journal

View full text Add to dashboard Cite

Summary Genomic prediction (GP) aims to construct a statistical model for predicting phenotypes using genome‐wide markers and is a promising strategy for accelerating molecular plant breeding. However, current progress of phenotype prediction using genomic data alone has reached a bottleneck, and previous studies on transcriptomic and metabolomic predictions ignored genomic information. Here, we designed a novel strategy of GP called multilayered least absolute shrinkage and selection operator (MLLASSO) by integrating multiple omic data into a single model that iteratively learns three layers of genetic features (GFs) supervised by observed transcriptome and metabolome. Significantly, MLLASSO learns higher order information of gene interactions, which enables us to achieve a significant improvement of predictability of yield in rice from 0.1588 (GP alone) to 0.2451 (MLLASSO). In the prediction of the first two layers, some genes were found to be genetically predictable genes (GPGs) as their expressions were accurately predicted with genetic markers. Interestingly, we made three dramatic discoveries for the GPGs: (i) GPGs are good predictors for highly complex traits like yield; (ii) GPGs are mostly eQTL genes (cis or trans); and (iii) trait‐related transcriptional factor families are enriched in GPGs. These findings support the notion that learned GFs not only are good predictors for traits but also have specific biological implications regarding regulation of gene expressions. To differentiate the new method from conventional GP models, we called MLLASSO a directed learning strategy supervised by intermediate omic data. This new prediction model appears to be more reliable and more robust than conventional GP models.

show abstract

Section: Resultsmentioning

confidence: 97%

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A directed learning strategy integrating multiple omic data improves genomic prediction

Xie

et al. 2019

Plant Biotechnology Journal

View full text Add to dashboard Cite

show abstract

“…In contrast to the ordinary linear model regressing y on X , model (2) does not require that the covariate X and the error η be independent, thus substantially relaxing the assumptions of ordinary regression models and being more appealing in data analysis. Wei et al 15 developed two-stage penalized estimation procedure to estimate the parameters and to simultaneously identify the possible instruments and genes that are associated with the phenotype y .…”

Section: Integrative Analysis Of Genetic Variants Molecular Phenotypmentioning

confidence: 99%

Systems biology approaches to epidemiological studies of complex diseases

2013

WIREs Mechanisms of Disease

Self Cite

View full text Add to dashboard Cite

Systems biology approaches to epidemiological studies of complex diseases include collection of genetic, genomic, epigenomic and metagenomic data in large-scale epidemiological studies of complex phenotypes. Designs and analyses of such studies raise many statistical challenges. This paper reviews some issues related to integrative analysis of such high dimensional and inter-related data sets and outline some possible solutions. I focus my review on integrative approaches for genome-wide genetic variants and gene expression data, methods for joint analysis of genetic and epigenetic variants and methods for analysis of microbiome data. Statistical methods such as mediation analysis, high dimensional instrumental variable regression, sparse signal recovery and compositional data regression provide potential frameworks for integrative analysis of these high dimensional genomic data.

show abstract

“…With the development of modern technology for data collection, high-dimensional data have become increasingly common in many scientific research fields, e.g., genome-wide studies (Lin et al 2015), biomedical sciences (Mukherjee et al. 2015), economics and finance (Basu and Michailidis 2015).…”

Section: Introductionmentioning

confidence: 99%

Regularized estimation in sparse high-dimensional multivariate regression, with application to a DNA methylation study

Zhang

Zheng

Yoon

et al. 2017

Statistical Applications in Genetics and Molecular Biology

View full text Add to dashboard Cite

Summary In this article, we consider variable selection for correlated high dimensional DNA methylation markers as multivariate outcomes. A novel weighted square-root LASSO procedure is proposed to estimate the regression coefficient matrix. A key feature of this method is tuning-insensitivity, which greatly simplifies the computation by obviating cross validation for penalty parameter selection. A working precision matrix obtained via the constrained ℓ1 minimization method (Cai et al. 2011) is used to account for the within-subject correlation among multivariate outcomes. Oracle inequalities of the regularized estimators are derived. The performance of our proposed method is illustrated via extensive simulation studies. We apply our method to study the relation between smoking and high dimensional DNA methylation markers in the Normative Aging Study (NAS).

show abstract

Regularization Methods for High-Dimensional Instrumental Variables Regression With an Application to Genetical Genomics

Cited by 72 publications

References 58 publications

A directed learning strategy integrating multiple omic data improves genomic prediction

A directed learning strategy integrating multiple omic data improves genomic prediction

Systems biology approaches to epidemiological studies of complex diseases

Regularized estimation in sparse high-dimensional multivariate regression, with application to a DNA methylation study

Contact Info

Product

Resources

About