Large single-cell atlases are now routinely generated to serve as references for analysis of smaller-scale studies. Yet learning from reference data is complicated by batch effects between datasets, limited availability of computational resources and sharing restrictions on raw data. Here we introduce a deep learning strategy for mapping query datasets on top of a reference called single-cell architectural surgery (scArches). scArches uses transfer learning and parameter optimization to enable efficient, decentralized, iterative reference building and contextualization of new datasets with existing references without sharing raw data. Using examples from mouse brain, pancreas, immune and whole-organism atlases, we show that scArches preserves biological state information while removing batch effects, despite using four orders of magnitude fewer parameters than de novo integration. scArches generalizes to multimodal reference mapping, allowing imputation of missing modalities. Finally, scArches retains coronavirus disease 2019 (COVID-19) disease variation when mapping to a healthy reference, enabling the discovery of disease-specific cell states. scArches will facilitate collaborative projects by enabling iterative construction, updating, sharing and efficient use of reference atlases.
The increasing generation of population-level single-cell atlases with hundreds or thousands of samples has the potential to link demographic and technical metadata with high-resolution cellular and tissue data in homeostasis and disease. Constructing such comprehensive references requires large-scale integration of heterogeneous cohorts with varying metadata capturing demographic and technical information. Here, we present single-cell population level integration (scPoli), a semi-supervised conditional deep generative model for data integration, label transfer and query-to-reference mapping. Unlike other models, scPoli learns both sample and cell representations, is aware of cell-type annotations and can integrate and annotate newly generated query datasets while providing an uncertainty mechanism to identify unknown populations. We extensively evaluated the method and showed its advantages over existing approaches. We applied scPoli to two population-level atlases of lung and peripheral blood mononuclear cells (PBMCs), the latter consisting of roughly 8 million cells across 2,375 samples. We demonstrate that scPoli allows atlas-level integration and automatic reference mapping with label transfer. It can explain sample-level biological and technical variations such as disease, anatomical location and assay by means of its novel sample embeddings. We use these embeddings to explore sample-level metadata, enable automatic sample classification and guide a data integration workflow. scPoli also enables simultaneous sample-level and cell-level analysis of gene expression patterns, revealing genes associated with batch effects and the main axes of between-sample variation. We envision scPoli becoming an important tool for population-level single-cell data integration facilitating atlas use but also interpretation by means of multi-scale analyses.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.