Advances in single-cell RNA sequencing over the past decade has shifted the discussion of cell identity towards the transcriptional state of the cell. While the incredible resolution provided by singlecell RNA sequencing has led to great advances in unravelling tissue heterogeneity and inferring cell differentiation dynamics, it raises the question of which sources of variation are important for determining cellular identity. Here we show that confounding biological sources of variation, most notably the cell cycle, can distort the inference of differentiation trajectories. We show that by factorizing single cell data into distinct sources of variation, we can select a relevant set of factors that constitute the core regulators for trajetory inference, while filtering out confounding sources of variation (e.g. cell cycle) which can perturb the inferred trajectory. Script are available publicly on https://github.com/mochar/cell_variation. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 single cell | pseudotime | factor analysis | trajectory inference S ingle cell RNA-sequencing enables quantitative gene ex-1 pression profiling of individual cells. From an RNA view-2 point, these cells live in a high-dimensional space defined by the 3 expression of their genes. A critical step when analyzing such 4data is the identification of cells in order to find and label the 5 cell types present in the data. This is often achieved by group-6 ing together cells with similar expression profiles by applying 7 a clustering method. The resulting cell clusters are thus sepa-8 rated from one another by a set of genes uniquely expressed or 9 silenced in a subset of clusters. These so-called marker genes 10 can then be used for identification by cross-referencing with 11 known marker genes or marker genes found in other studies.
12This clustering-based approach for cell identification relies on 13 the general presumption that the measured expression lev-14 els are reflective of the cell's identity, which may be violated 15 due to shared transcriptional programs between two or more 16 types. Large variations within cell type clusters due to many 17 exclusive programs may also pose a problem as it can become 18 hard to discern between cell types and cell states (1). More 19 generally, sources of variation that contribute significantly to 20 the cell-cell distances in gene space, yet do not reflect the cell 21 type, can be detrimental to the identification task. These can 22 vary from small transient changes e.g. cell communication, 23 up to complex shifts in the cell's regulatory state such as the 24 cell cycle, which has been reported to contribute a substantial 25 portion of the gene expression variance (2). Moreover, cell 26 identification is often preceded by a gene filtering step whereby 27 genes with low variance (and therefore little information) are 28 discarded to ease the computational burden in downstream 29 analysis. Depending on the normalization and filtering criteria 30 used, gene filtering can lead to a lower dimensional space that 31 f...