Convoluted biological processes underlie the development of multicellular organisms and diseases. Advances in scRNA-seq make it possible to study these processes from cells at various developmental stages. Achieving accurate characterization is challenging, however, particularly for periodic processes, such as cell cycles. To address this, we developed Cyclum, a novel AutoEncoder approach that characterizes circular trajectories in the high-dimensional gene expression space. Cyclum substantially improves the accuracy and robustness of cell-cycle characterization beyond existing approaches. Applying Cyclum to removing cell-cycle effects leads to substantially improved delineations of cell subpopulations, which is useful for establishing various cell atlases and studying tumor heterogeneity. Cyclum is available at https://github.com/KChen-lab/cyclum.
BackgroundConvoluted biological processes, which involve cell proliferation, differentiation, state transition, and cell-to-cell communication [1,2]. The course of development can be influenced by genetic (e.g., mutations), epigenetic, and environmental factors. Alterations to the genome, transcriptome, and proteome of individual cells also can result in pathogeneses [3]. Early efforts have been made to reconstruct the temporal ordering of biological samples using bulk data [4,5], although challenges associated with cellular heterogeneity make it difficult to infer accurate time series. Advances in single-cell RNA sequencing (scRNA-seq) enabled large-scale acquisition of singlecell transcriptomic profiles and provided an unprecedented opportunity to uncover latent biological processes that orchestrate dynamic expression of genes in single cells throughout the course of the development [6]. However, it is very challenging to deconvolute these processes from scRNA-seq data accurately. A sufficiently large number of cells across time, lineage, and space need to be sampled in order to capture detailed sub-populational features and reduce technological noise. Tremendous efforts have been made to develop trajectory inference methods from scRNA-seq data. Over 59 methods have been developed since 2014 [7], including the widely known Monocle and Wanderlust. These methods represent biological processes in linear, bifurcating, or other graph topologies.In many developmental processes, such as embryogenesis, organogenesis, and tumorigenesis [8], cell cycle plays a fundamental role. Distinct from processes that evoke linear changes in gene expression, cell cycle causes periodicity. A cycle starts from the G1 phase, goes through S and G2/M, and then returns to G1 within 24-hours for human cells [2]. This process is orchestrated elegantly by variable sets of genes (e.g., cyclins and cyclin-dependent kinases) that are turned on and off at relatively precise timings. As a result of such periodicity, the cycling cells at different transcriptomic states form a circular, non-linear trajectory in high-dimensional gene expression spaces. The positions of a cell alongside the circular trajec...