BackgroundNowadays, multiple omics data are measured on the same samples in the belief that these different omics datasets represent various aspects of the underlying biological systems. Integrating these omics datasets will facilitate the understanding of the systems. For this purpose, various methods have been proposed, such as Partial Least Squares (PLS), decomposing two datasets into joint and residual subspaces. Omics data are heterogeneous, and the joint subspaces estimated in PLS contain orthogonal variations unrelated to one another. Alternatively, Two-way Orthogonal Partial Least Squares (O2PLS) captures the heterogeneity by introducing the orthogonal subspaces and better estimates the joint subspaces. However, the latent components spanning the joint subspaces in O2PLS are linear combinations of all variables, while the interests of domain experts might be in a small subset. To obtain sparsity, we extend O2PLS to Group Sparse O2PLS (GO2PLS) that performs feature selection. Furthermore, features in the data often have group structures, and incorporating this information might improve the reliability of the selection procedure.ResultsThe simulation study showed that introducing sparsity improved the performance concerning feature selection. Furthermore, incorporating group structures increased the precision and power of the feature selection procedure. GO2PLS performed optimally in terms of accuracy of joint score estimation, joint loading estimation, and feature selection. For illustration, we applied GO2PLS to datasets from two studies: CVON-DOSIS (small case-control study) and TwinsUK (population study). In the first, by integrating regulomics and transcriptomics data, joint components using GO2PLS discriminated cardiomyopathy patients and controls better than PCA and PLS. Genes selected based on the regulatory regions and transcripts appeared to be relevant to heart muscle disease. In the second, we incorporated the information on the group structures of the methylation CpG sites when integrating the methylation dataset with the glycomics data. The selected methylation groups turned out to be relevant to the immune system, in which glycans play an important role.ConclusionsGO2PLS is an efficient approach for integrating two heterogeneous omics datasets to gain insights. It performs feature selection in both datasets, enhanced by incorporating known group structures, thereby resulting in a small subset of features for further investigation.