Probabilistic partial least squares model: Identifiability, estimation and application

Bouhaddani, Said el; Uh, Hae‐Won; Hayward, Caroline; Jongbloed, Geurt; Houwing‐Duistermaat, Jeanine J.

doi:10.1016/j.jmva.2018.05.009

Cited by 13 publications

(53 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Extending latent variable methods to probabilistic models is not new. PCA was extended to Probabilistic PCA in [4], and PPLS [10] was proposed to provide a probabilistic framework for PLS. It has been shown that the probabilistic counterpart has a lower bias in estimation and is robust to non-normally distributed variables [10].…”

Section: Discussionmentioning

confidence: 99%

“…PCA was extended to Probabilistic PCA in [4], and PPLS [10] was proposed to provide a probabilistic framework for PLS. It has been shown that the probabilistic counterpart has a lower bias in estimation and is robust to non-normally distributed variables [10]. More importantly, the probabilistic model will allow statistical inference, making it possible to interpret the relevance and importance of features in the population, and facilitating follow-up studies.…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

Statistical Integration of Two Omics Datasets Using GO2PLS

Bouhaddani

Pei

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

BackgroundNowadays, multiple omics data are measured on the same samples in the belief that these different omics datasets represent various aspects of the underlying biological systems. Integrating these omics datasets will facilitate the understanding of the systems. For this purpose, various methods have been proposed, such as Partial Least Squares (PLS), decomposing two datasets into joint and residual subspaces. Omics data are heterogeneous, and the joint subspaces estimated in PLS contain orthogonal variations unrelated to one another. Alternatively, Two-way Orthogonal Partial Least Squares (O2PLS) captures the heterogeneity by introducing the orthogonal subspaces and better estimates the joint subspaces. However, the latent components spanning the joint subspaces in O2PLS are linear combinations of all variables, while the interests of domain experts might be in a small subset. To obtain sparsity, we extend O2PLS to Group Sparse O2PLS (GO2PLS) that performs feature selection. Furthermore, features in the data often have group structures, and incorporating this information might improve the reliability of the selection procedure.ResultsThe simulation study showed that introducing sparsity improved the performance concerning feature selection. Furthermore, incorporating group structures increased the precision and power of the feature selection procedure. GO2PLS performed optimally in terms of accuracy of joint score estimation, joint loading estimation, and feature selection. For illustration, we applied GO2PLS to datasets from two studies: CVON-DOSIS (small case-control study) and TwinsUK (population study). In the first, by integrating regulomics and transcriptomics data, joint components using GO2PLS discriminated cardiomyopathy patients and controls better than PCA and PLS. Genes selected based on the regulatory regions and transcripts appeared to be relevant to heart muscle disease. In the second, we incorporated the information on the group structures of the methylation CpG sites when integrating the methylation dataset with the glycomics data. The selected methylation groups turned out to be relevant to the immune system, in which glycans play an important role.ConclusionsGO2PLS is an efficient approach for integrating two heterogeneous omics datasets to gain insights. It performs feature selection in both datasets, enhanced by incorporating known group structures, thereby resulting in a small subset of features for further investigation.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Statistical Integration of Two Omics Datasets Using GO2PLS

Bouhaddani

Pei

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Such analyses will provide information about the genes highly represented by the two datasets. Another extension was developed by Bouhaddani, Uh, Hayward, Jongbloed, and Houwing‐Duistermaat (2018); they embedded PLS in a probabilistic framework to facilitate statistical inference and unique identification of the parameters. One of the future directions is to compare the performance of the measurement error models and PLS methods, where the former uses additional information about the structure of the data, and the latter estimates this structure from the datasets.…”

Section: Conclusion and Discussionmentioning

confidence: 99%

Statistical method for modeling sequencing data from different technologies in longitudinal studies with application to Huntington disease

et al. 2020

View full text Add to dashboard Cite

Advancement of gene expression measurements in longitudinal studies enables the identification of genes associated with disease severity over time. However, problems arise when the technology used to measure gene expression differs between time points. Observed differences between the results obtained at different time points can be caused by technical differences. Modeling the two measurements jointly over time might provide insight into the causes of these different results. Our work is motivated by a study of gene expression data of blood samples from Huntington disease patients, which were obtained using two different sequencing technologies. At time point 1, DeepSAGE technology was used to measure the gene expression, with a subsample also measured using RNA‐Seq technology. At time point 2, all samples were measured using RNA‐Seq technology. Significant associations between gene expression measured by DeepSAGE and disease severity using data from the first time point could not be replicated by the RNA‐Seq data from the second time point. We modeled the relationship between the two sequencing technologies using the data from the overlapping samples. We used linear mixed models with either DeepSAGE or RNA‐Seq measurements as the dependent variable and disease severity as the independent variable. In conclusion, (1) for one out of 14 genes, the initial significant result could be replicated with both technologies using data from both time points; (2) statistical efficiency is lost due to disagreement between the two technologies, measurement error when predicting gene expressions, and the need to include additional parameters to account for possible differences.

show abstract

“…An example is envelope regression (Cook and Zhang 2015), which fully models the covariance structure and is therefore not suited for high dimensional data. Alternatively, probabilistic PLS (PPLS) (el Bouhaddani et al 2018a) uses a simpler covariance structure with less parameters and is applicable to high dimensional datasets. In contrast to PPLS and envelope regression, SIFA (Li and Jung 2017) models specific components.…”

Section: Introductionmentioning

confidence: 99%

Statistical Integration of Heterogeneous Data with PO2PLS

Bouhaddani,

Uh,

Jongbloed

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

The availability of multi-omics data has revolutionized the life sciences by creating avenues for integrated system-level approaches. Data integration links the information across datasets to better understand the underlying biological processes. However, high-dimensionality, correlations and heterogeneity pose statistical and computational challenges. We propose a general framework, probabilistic two-way partial least squares (PO2PLS), which addresses these challenges. PO2PLS models the relationship between two datasets using joint and data-specific latent variables. For maximum likelihood estimation of the parameters, we implement a fast EM algorithm and show that the estimator is asymptotically normally distributed. A global test for testing the relationship between two datasets is proposed, and its asymptotic distribution is derived. Notably, several existing omics integration methods are special cases of PO2PLS. Via extensive simulations, we show that PO2PLS performs better than alternatives in feature selection and prediction performance. In addition, the asymptotic distribution appears to hold when the sample size is sufficiently large. We illustrate PO2PLS with two examples from commonly used study designs: a large population cohort and a small case-control study. Besides recovering known relationships, PO2PLS also identified novel findings. The methods are implemented in our R-package PO2PLS. Supplementary materials for this article are available online.

show abstract

Probabilistic partial least squares model: Identifiability, estimation and application

Cited by 13 publications

References 26 publications

Statistical Integration of Two Omics Datasets Using GO2PLS

Statistical Integration of Two Omics Datasets Using GO2PLS

Statistical method for modeling sequencing data from different technologies in longitudinal studies with application to Huntington disease

Statistical Integration of Heterogeneous Data with PO2PLS

Contact Info

Product

Resources

About