2020
DOI: 10.1101/2020.05.27.119438
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

mbkmeans: fast clustering for single cell data using mini-batchk-means

Abstract: Single-cell RNA-Sequencing (scRNA-seq) is the most widely used high-throughput technology to measure genome-wide gene expression at the single-cell level. One of the most common analyses of scRNA-seq data detects distinct subpopulations of cells through the use of unsupervised clustering algorithms. However, recent advances in scRNA-seq technologies result in current datasets ranging from thousands to millions of cells. Popular clustering algorithms, such as k-means, typically require the data to be loaded ent… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
2

Relationship

2
0

Authors

Journals

citations
Cited by 2 publications
(4 citation statements)
references
References 38 publications
0
4
0
Order By: Relevance
“…The percent of cell counts mapping to mtDNA genes in this representation is shown in ( Figure 5A ). Using the top 50 principal components, we performed unsupervised clustering using the mini-batch k -means (mbkmeans) algorithm [43] implemented in the mbkmeans [44] R/Bioconductor package for unsupervised clustering to identify cell types, which is a scalable version of the widely-used k -means algorithm [45–47] ( Figure 5B ). The number of clusters ( k =6) was determined using an elbow plot with the sum of squared errors ( Figure S2 ).…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…The percent of cell counts mapping to mtDNA genes in this representation is shown in ( Figure 5A ). Using the top 50 principal components, we performed unsupervised clustering using the mini-batch k -means (mbkmeans) algorithm [43] implemented in the mbkmeans [44] R/Bioconductor package for unsupervised clustering to identify cell types, which is a scalable version of the widely-used k -means algorithm [45–47] ( Figure 5B ). The number of clusters ( k =6) was determined using an elbow plot with the sum of squared errors ( Figure S2 ).…”
Section: Resultsmentioning
confidence: 99%
“…To represent the effect of miQC on downstream analyses, we calculated and plotted the Uniform Manifold Approximation and Projection (UMAP) representation of the single-cell expression data using functions in the scater package. We chose to highlight how miQC filtering specifically affects clustering results using the mbkmeans package, which uses mini-batches to quickly and scalably produce k-means clustering assignments [44]. We ran mbkmeans on a reduced representation of our expression data, the first 50 principal components as calculated via scater.…”
Section: Methodsmentioning
confidence: 99%
“…The massive number of cells, combined with the large number of genes make even simple scaling normalization demanding. For instance, scran applied to a dataset of 1.3 million datasets take more than 5 hours (Hicks et al, 2021).…”
Section: Discussionmentioning
confidence: 99%
“…The increase in the number of cells per experiment translates into a dramatic increase in the data points to be analyzed, requiring methods able to efficiently scale to millions of cells, both in terms of memory usage and computational time. Typically, each step of the analysis, from normalization to clustering and functional analyses, can be highly demanding when dealing with hundreds of thousands or even millions of cells (Hicks et al, 2021;Lähnemann et al, 2020). In this perspective, a desirable normalization method should be able to scale efficiently with the number of cells, while simultaneously maintaining a good performance.…”
Section: Introductionmentioning
confidence: 99%