Identifying gene expression programs underlying cell-type identity and cellular processes is a crucial step toward understanding the organization of cells and tissues. Although single-cell RNA-Seq (scRNA-Seq) can quantify transcripts in individual cells, each cell's expression may derive both from programs determining cell-type and from programs facilitating dynamic cellular activities such as cell-division or apoptosis, which cannot be easily disentangled with current methods. Here, we introduce clustered nonnegative matrix factorization (cNMF) as a solution to this problem. We show with simulations that it deconvolutes scRNA-Seq profiles into interpretable programs corresponding to both cell-types and cellular activities. Applied to published brain organoid and visual cortex datasets, cNMF refines the hierarchy of cell-types and identifies both expected (e.g. cell-cycle and hypoxia) and intriguing novel activity programs. In summary, we show that cNMF can increase the accuracy of cell-type identification while simultaneously inferring interpretable cellular activity programs in scRNA-Seq data, thus providing useful insight into how cells vary dynamically within cell-types.. CC-BY-NC-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.The copyright holder for this preprint . http://dx.doi.org/10.1101/310599 doi: bioRxiv preprint first posted online Apr. 30, 2018; 3
Main TextGenes act in concert to maintain a cell's identity as a specific cell-type, to respond to external signals, and to carry out complex cellular activities such as replication and differentiation. Coordinating the necessary genes for these functions is frequently achieved through transcriptional co-regulation, whereby genes are induced together as a gene expression program (GEP) in response to the appropriate internal or external signal 1,2 . Transcriptome-wide expression profiling technologies such as RNA-Seq have made it possible to conduct systematic and unbiased discovery of GEPs which, in turn, have shed light on the mechanisms underlying many cellular processes 3 .In traditional RNA-Seq, measurements are limited to an average expression profile of potentially dozens of cell-types in a tissue. Any observed changes in gene expression could reflect induction of a program in some specific cell-type(s), an average of many different changes in multiple cell-types, or changes in overall cell-type composition. The development of scRNA-Seq avoids this problem by measuring the expression of thousands of individual cells simultaneously. This permits determination of the cell-types in the sample as well as changes to any of their gene expression profiles. Exploiting this new technology, large-scale projects such as the Tabula Muris and the Human Cell Atlas are seeking to identify and characterize all the cell types in complex organisms in states of both health and disease 4,5 .Even with the ability to quantify e...