Multi-omics integrative analysis can capture the associations of different omics and thus provides a comprehensive view of the complex mechanisms in cancers. However, it is common that one portion of samples miss one type of omics data due to various limitations in experiments, which can be an obstacle for downstream analysis where complete dataset is needed. Current imputation methods mainly focus on single cancer dataset, which are limited by their ability to capture information from large pan-cancer dataset. We present a novel transfer learning-based deep neural network to impute missing gene expression data from DNA methylation data, namely TDimpute. The pan-cancer dataset was utilized to train a general model for all cancers, which was then fine-tuned on each cancer dataset for the specific cancer. We compared our method to other state-of-the-art methods on 16 cancer datasets, and found that our method consistently outperforms other methods in terms of imputation error, methylation-expression correlations recovery, and downstream analysis including the identification of DNA methylation-driving genes and prognosis-related genes, clustering analysis, and survival analysis. The improvements are especially pronounced at high missing rates.
Author summaryAs an epigenetic modification, DNA methylation plays an important role in regulating gene expression. However, due to limitations of sample availability and cost, some samples aren't measured with gene expression, which results in a reduced sample size for integrative analysis of DNA methylation and gene expression. The accuracy of traditional imputation methods are limited since they cannot effectively utilize the information from DNA methylation data and other relevant datasets. With the power of modeling nonlinear relationship, we used deep neural network to impute missing gene expression data using the nonlinear transformation from DNA methylation data to gene expression data. We also employed transfer learning to alleviate the data insufficiency in training the deep leaning model. In 16 cancer datasets from The Cancer Genome Atlas (TCGA), our method yields higher accuracy compared to other methods. More importantly, better performance of the downstream analysis on imputed gene expression datasets are achieved, which indicates the missing data imputed by our method are more biologically meaningful.