The integration of publicly available and new patient-derived transcriptomic datasets is not straightforward and requires specialized approaches to deal with heterogeneity at technical and biological levels. Here we present a methodology that can overcome technical biases, predict clinically relevant outcomes and identify tumour-related biological processes in patients using previously collected large reference datasets. The approach is based on independent component analysis (ICA) -an unsupervised method of signal deconvolution. We developed parallel consensus ICA that robustly decomposes merged new and reference datasets into signals with minimal mutual dependency. By applying the method to a small cohort of primary melanoma and control samples combined with a large public melanoma dataset, we demonstrate that our method distinguishes celltype specific signals from technical biases and allows to predict clinically relevant patient characteristics. Cancer subtypes, patient survival and activity of key tumour-related processes such as immune response, angiogenesis and cell proliferation were characterized. Additionally, through integration of transcriptomes and miRNomes, the method identified biological functions of miRNAs, which would otherwise not be possible.
RNA-seq and miRNA-seq of investigation set:GDC data portal: https://portal.gdc.cancer.gov/
Expression data of validation set:Array Express: https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-19234/
New expression data of investigation set:The sequencing data for 3 primary melanoma tumours and 2 controls are freely available under the GEO accession number GSE116111. Data for miRNAs are in the Supplementary Table S5. Tools: Consensus parallel ICA: https://gitlab.com/biomodlih/consica R/Bioconductor v.3.4.3 with packages fastICA, doMC, doSNOW, topGO, randomForest, survival https://cran.r-project.org/ Enrichr ACKNOWLEDGEMENT We would like to thank the patients for providing clinical material and the clinical staff of the Dermatology Unit at the University of Freiburg for professionally taking and handling all the samples. Bioinformatics analyses presented in this paper were partly carried out using the high-performance computing facilities of the University of Luxembourg (http://hpc.uni.lu). Furthermore, we thank Demetra Philippidou and Dr. Susanne Reinsbach for processing clinical samples and NHEM cells forming the investigation dataset. We acknowledge Dr. Aurélien Ginolhac and Cristina Maximo for critically reading the manuscript and for valuable input.