Understanding the dynamic interactions between malignant cells and the tumor stroma is a major goal of cancer research. Here we developed a Bayesian model that jointly infers both cellular composition and gene expression in each cell type, including heterogeneous malignant cells, from bulk RNA-seq using scRNA-seq as prior information. We conducted an integrative analysis of 85 single-cell and 1,412 bulk RNA-seq datasets in primary human glioblastoma, head and neck squamous cell carcinoma, and melanoma. We identified cell types correlated with clinical outcomes and explored regional heterogeneity in tumor state and stromal composition. We redefined common molecular subtypes using gene expression in malignant cells, after excluding confounding non-malignant cell types. Finally, we identified genes whose expression in malignant cells correlated with infiltration of macrophages, T-cells, fibroblasts, and endothelial cells across multiple tumor types. Our work provides a new lens that we used to measure cellular composition and expression in a statistically powered cohort of three primary human malignancies.
Results
Bayesian inference of cell type composition and tumor expressionTED uses a scRNA-seq reference dataset to infer two parameters of interest from bulk RNA-seq data: (i) the proportion of cell types in the bulk population and (ii) the average expression profiles of each cell type. TED describes the proportion and expression profiles of each cell type as latent variables that it infers from the data ( Fig. 1a, Supplementary Fig. 1, Supplementary Note 1). TED makes the key simplifying assumption that each non-malignant cell type shares a common gene expression profile across patients, as observed in the cases analyzed to date 28,30,31 .Critically, each bulk RNA-seq sample is then assumed to have a unique tumor expression profile that we infer from the data.Expression in the reference and bulk RNA-seq data often differ substantially due to batch effects or tumor heterogeneity. To account for uncertainty in the reference cell type expression matrix, TED implements a fully Bayesian inference of tumor composition. First, TED uses Gibbs sampling to estimate the posterior joint distribution of cell type composition, θ0, and gene expression profiles, Z, i.e. P(θ0, Z | φ, X; α) ( Fig. 1a, red, top). Second, to account for tumor cells, or other cell types which cannot be observed in the reference dataset, TED infers a maximum likelihood estimate (MLE) for tumor expression profiles, ψtum (Fig. 1a, red, mid). During this step TED also infers the maximum a posterior estimate (MAP) of the expression profile of non-malignant stromal cells, ψstr, to correct for batch effects between bulk and single cell RNAseq platforms. Last, TED uses the updated expression profile for each patient and cell type to resample the posterior distribution of cell type composition, θ, i.e. P(θ | ψtum , ψstr ,X; α) ( Fig. 1a, red, bottom). Optionally, TED has an additional mode which can be used to learn common patterns of expression heterogeneit...