Feature selection (marker gene selection) is widely believed to improve clustering accuracy, and is thus a key component of single cell clustering pipelines. Existing feature selection methods perform inconsistently across datasets, occasionally even resulting in poorer clustering accuracy than without feature selection. Moreover, existing methods ignore information contained in gene-gene correlations. Here, we introduce DUBStepR (Determining the Underlying Basis using Stepwise Regression), a feature selection algorithm that leverages gene-gene correlations with a novel measure of inhomogeneity in feature space, termed the Density Index (DI). Despite selecting a relatively small number of genes, DUBStepR substantially outperformed existing single-cell feature selection methods across diverse clustering benchmarks. Additionally, DUBStepR was the only method to robustly deconvolve T and NK heterogeneity by identifying disease-associated common and rare cell types and subtypes in PBMCs from rheumatoid arthritis patients. DUBStepR is scalable to over a million cells, and can be straightforwardly applied to other data types such as single-cell ATAC-seq. We propose DUBStepR as a general-purpose feature selection solution for accurately clustering single-cell data.
Feature selection (marker gene selection) is widely believed to improve clustering accuracy, and is thus a key component of single cell clustering pipelines. However, we found that the performance of existing feature selection methods was inconsistent across benchmark datasets, and occasionally even worse than without feature selection. Moreover, existing methods ignored information contained in gene-gene correlations. We therefore developed DUBStepR (Determining the Underlying Basis using Stepwise Regression), a feature selection algorithm that leverages gene-gene correlations with a novel measure of in-homogeneity in feature space, termed the Density Index (DI). Despite selecting a relatively small number of genes, DUB-StepR substantially outperformed existing single-cell feature selection methods across diverse clustering benchmarks. In a published scRNA-seq dataset from sorted monocytes, DUBStepR sensitively detected a rare and previously invisible population of contaminating basophils. DUBStepR is scalable to large datasets, and can be straightforwardly applied to other data types such as single-cell ATAC-seq. We propose DUBStepR as a general-purpose feature selection solution for accurately clustering single-cell data.
Tissue fibrosis affects multiple organs and involves a master-regulatory role of macrophages which respond to an initial inflammatory insult common in all forms of fibrosis. The recently unraveled multiorgan heterogeneity of macrophages in healthy and fibrotic human disease suggest that tissue resident macrophages, expressing osteopontin (SPP1), associate with lung and liver fibrosis. However, the conservation of this SPP1+ macrophage population across different tissues, and its specificity to fibrotic diseases with different etiologies remain unclear. Integrating 13 single cell RNA-sequencing datasets to profile 225,985 tissue macrophages from healthy and fibrotic heart, lung, liver, kidney, skin and endometrium, we extended the association of SPP1+ macrophages with fibrosis to all these tissues. We also identified a subpopulation expressing matrisome-associated genes (e.g., matrix metalloproteinases and their tissue inhibitors), functionally enriched for ECM remodeling and cell metabolism, representative of a matrisome-associated macrophage (MAM) polarization state within SPP1+ macrophages. Importantly, the MAM polarization state follows a differentiation trajectory from SPP1+ macrophages, which was conserved across all fibrotic tissues and driven by NFATC1 and HIVEP3 regulons. Unlike SPP1+ macrophages, the MAM polarization state shows a positive association with ageing in mice and humans, and across multiple tissues during homeostasis. These results suggest an advanced, agedependent polarization state of SPP1+ macrophages in fibrotic tissues as a result of prolonged inflammatory cues within each tissue microenvironment.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.