Single-cell transcriptomics offers a tool to study the diversity of cell phenotypes through snapshots of the abundance of mRNA in individual cells. Often there is additional information available besides the single cell gene expression counts, such as bulk transcriptome data from the same tissue, or quantification of surface protein levels from the same cells. In this study, we propose models based on the Bayesian generative approach, where protein quantification available as CITE-seq counts from the same cells are used to constrain the learning process, thus forming a semi-supervised model. The generative model is based on the deep variational autoencoder (VAE) neural network architecture.Keywords semi-supervised · single-cell · RNA sequencing · deep learning · Bayesian inference 1 IntroductionSingle-cell RNA sequencing (scRNA-seq) [1,2,3] is a powerful tool to analyze cell states based on their gene expression profile with high resolution. RNA sequencing at single-cell level facilitates uncovering heterogeneous gene expression patterns in seemingly homogeneous cell populations. However, the current methods for gene expression profiling at single cell resolution are prone to experimental errors, in particular, inefficient capture of mRNAs [2]. This capture inefficiency results into a general underestimation of the counts (dropout effect). This represents a problem as the current computational approaches for analyzing single-cell data rely on the mRNA counts for clustering and downstream analysis.Generally, the solution to the dropout problem has been posed as an imputation task, where missing counts are filled with estimated counts. Different methods have been proposed for this task, such as non-negative regression [4] or graph-based methods [5]. Another option is to model the dropout effect using the zero-inflated (ZI) model [6], where a two-component mixture distribution is constructed, such that the first component models the dropout effect and the second component the observed counts. The effect of overdispersion is strongly presented in the scRNA-seq counts, the negative binomial (NB) distribution is seen as an appropriate fit to the observed data [7]. Shallow imputation models that are based on zero-inflated negative binomial (ZINB) or zero-inflated log-normal models have been applied to single-cell data [8,9]. However, these models hypothesize a linear relation between the latent space and the model parameters, which is quite a strong assumption [10]. To overcome the limitations of the linear models, deep neural network architectures have been proposed to resolve missing data (dropouts) [11]. However, discerning technical A PREPRINT -MAY 8, 2019 variation from biological signal solely based on scRNA-seq data is challenging, and assumes that a large number of similar cells are measured.Accurate imputation strategies are important for downstream analysis, including identification of cell type marker genes, characterization of functional state [12], or the analysis of transcriptome dynamics along differentia...