1The recent advancements in single-cell technologies, including single-cell chromatin accessibility sequencing 2 (scCAS), have enabled profiling the epigenetic landscapes for thousands of individual cells. However, the character-3 istics of scCAS data, including high dimensionality, high degree of sparsity and high technical variation, make the 4 computational analysis challenging. Reference-guided approach, which utilizes the information in existing datasets, 5 may facilitate the analysis of scCAS data. We present RA3 (Reference-guided Approach for the Analysis of single-6 cell chromatin Acessibility data), which utilizes the information in massive existing bulk chromatin accessibility and 7 annotated scCAS data. RA3 simultaneously models 1) the shared biological variation among scCAS data and the 8 reference data, and 2) the unique biological variation in scCAS data that identifies distinct subpopulations. We show 9 that RA3 achieves superior performance in many scCAS datasets. We also present several approaches to construct 10 the reference data to demonstrate the wide applicability of RA3.
12Chromatin accessibility is a measure of the physical access of nuclear macromolecules to DNA and is essential for 13 understanding the regulatory mechanism 1, 2 . For rapid and sensitive probing of chromatin accessibility, assay for 14 transposase-accessible chromatin using sequencing (ATAC-seq) directly inserts sequencing adaptors into accessible 15 chromatin regions using hyperactive Tn5 transposase in vitro 3 . With the recent advancements in technology, single-cell 16 chromatin accessibility sequencing (scCAS) further enables the investigation of epigenomic landscape in individual 17 cells 4, 5 . However, the analysis of scCAS data is challenging because of its high dimensionality and high degree of 18 sparsity, as the low copy number (two of a diploid-genome) of DNA leads to only 1-10% capture rate for the hundreds 19 of thousands of possible accessible peaks 6 . The proposed approaches for the analysis of single-cell Seq) data thus present limitations due to the novelty and assay-specific challenges of extreme sparsity and tens of times 21 higher dimensions 6 .
22Several computational algorithms have been proposed to analyze scCAS data. chromVAR assesses the variation of 23 chromatin accessibility using groups of peaks that share the same functional annotations 7 . scABC calculates weights 24 of cells based on the number of distinct reads within the peak background and then uses weighted k-medoids to cluster 25 the cells 8 . cisTopic applies latent Dirichlet allocation (LDA) model to explore cis-regulatory regions and character-26 izes cell heterogeneity from the generated regions-by-topics and topics-by-cells probability matrices 9 . Cusanovich et 27 1 al. proposed a method that performs the term frequency-inverse document frequency transformation (TF-IDF) and 28 singular value decomposition (SVD) iteratively to get the final feature matrix 5, 10 . Scasat uses Jaccard distance to 29 evaluate the dissimilarity o...