mbkmeans: fast clustering for single cell data using mini-batch<i>k</i>-means

Hicks, Stephanie C.; Liu, Ruoxi; Ni, Yuwei; Purdom, Elizabeth; Risso, Davide

doi:10.1101/2020.05.27.119438

Cited by 2 publications

(4 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The percent of cell counts mapping to mtDNA genes in this representation is shown in ( Figure 5A ). Using the top 50 principal components, we performed unsupervised clustering using the mini-batch k -means (mbkmeans) algorithm [43] implemented in the mbkmeans [44] R/Bioconductor package for unsupervised clustering to identify cell types, which is a scalable version of the widely-used k -means algorithm [45–47] ( Figure 5B ). The number of clusters ( k =6) was determined using an elbow plot with the sum of squared errors ( Figure S2 ).…”

Section: Resultsmentioning

confidence: 99%

“…To represent the effect of miQC on downstream analyses, we calculated and plotted the Uniform Manifold Approximation and Projection (UMAP) representation of the single-cell expression data using functions in the scater package. We chose to highlight how miQC filtering specifically affects clustering results using the mbkmeans package, which uses mini-batches to quickly and scalably produce k-means clustering assignments [44]. We ran mbkmeans on a reduced representation of our expression data, the first 50 principal components as calculated via scater.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

miQC: An adaptive probabilistic framework for quality control of single-cell RNA-sequencing data

Hippen

Falco

Weber³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Motivation: Single-cell RNA-sequencing has made it possible to profile gene expression in tissues at high resolution. An important preprocessing step prior to performing downstream analyses is to identify and remove cells with poor or degraded sample quality using quality control (QC) metrics. Two widely used QC metrics to identify a 'low-quality' cell are (i) if the cell includes a high proportion of reads that map to mitochondrial DNA (mtDNA) encoded genes and (ii) if a small number of genes are detected. Current best practices use these QC metrics independently with either arbitrary, uniform thresholds (e.g. 5%) or biological context-dependent (e.g. species) thresholds, and fail to jointly model these metrics in a data-driven manner. Current practices are often overly stringent and especially untenable on lower-quality tissues, such as archived tumor tissues. Results: We propose a data-driven QC metric (miQC) that jointly models both the proportion of reads mapping to mtDNA genes and the number of detected genes with mixture models in a probabilistic framework to predict the low-quality cells in a given dataset. We demonstrate how our QC metric easily adapts to different types of single-cell datasets to remove low-quality cells while preserving high-quality cells that can be used for downstream analyses. Availability: Software available at https://github.com/greenelab/miQC. The code used to download datasets, perform the analyses, and reproduce the figures is available at https://github.com/greenelab/mito-filtering.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

miQC: An adaptive probabilistic framework for quality control of single-cell RNA-sequencing data

Hippen

Falco

Weber³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The massive number of cells, combined with the large number of genes make even simple scaling normalization demanding. For instance, scran applied to a dataset of 1.3 million datasets take more than 5 hours (Hicks et al, 2021).…”

Section: Discussionmentioning

confidence: 99%

“…The increase in the number of cells per experiment translates into a dramatic increase in the data points to be analyzed, requiring methods able to efficiently scale to millions of cells, both in terms of memory usage and computational time. Typically, each step of the analysis, from normalization to clustering and functional analyses, can be highly demanding when dealing with hundreds of thousands or even millions of cells (Hicks et al, 2021;Lähnemann et al, 2020). In this perspective, a desirable normalization method should be able to scale efficiently with the number of cells, while simultaneously maintaining a good performance.…”

Section: Introductionmentioning

confidence: 99%

PsiNorm: a scalable normalization for single-cell RNA-seq data

Borella

Martello

Risso

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Single-cell RNA sequencing (scRNA-seq) enables transcriptome-wide gene expression measurements at single-cell resolution providing a comprehensive view of the compositions and dynamics of tissue and organism development. The evolution of scRNA-seq protocols has led to a dramatic increase of cells throughput, exacerbating many of the computational and statistical issues that previously arose for bulk sequencing. In particular, with scRNA-seq data all the analyses steps, including normalization, have become computationally intensive, both in terms of memory usage and computational time. In this perspective, new accurate methods able to scale efficiently are desirable. Here we propose PsiNorm, a between-sample normalization method based on the power-law Pareto distribution parameter estimate. Here we show that the Pareto distribution well resembles scRNA-seq data, independently of sequencing depths and technology. Motivated by this result, we implement PsiNorm, a simple and highly scalable normalization method. We benchmark PsiNorm with other seven methods in terms of cluster identification, concordance and computational resources required. We demonstrate that PsiNorm is among the top performing methods showing a good trade-off between accuracy and scalability. Moreover PsiNorm does not need a reference, a characteristic that makes it useful in supervised classification settings, in which new out-of-sample data need to be normalized. PsiNorm is available as an R package available at https://github.com/MatteoBlla/PsiNorm

show abstract

mbkmeans: fast clustering for single cell data using mini-batchk-means

Cited by 2 publications

References 38 publications

miQC: An adaptive probabilistic framework for quality control of single-cell RNA-sequencing data

miQC: An adaptive probabilistic framework for quality control of single-cell RNA-sequencing data

PsiNorm: a scalable normalization for single-cell RNA-seq data

Contact Info

Product

Resources

About