Scedar: A scalable Python package for single-cell RNA-seq exploratory data analysis

Zhang, Yuanchao; Kim, Man S.; Reichenberger, Erin R.; Stear, Benjamin; Taylor, Deanne

doi:10.1371/journal.pcbi.1007794

Cited by 12 publications

(5 citation statements)

References 72 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A key analytical need for the PCA, analyses of longitudinal datasets, would be driven by PCA researcher needs. Computational methods with the ability to cluster and visualize cellular heterogeneity across millions of cells have been recently introduced (Cho et al, 2018, Wolf et al, 2018, Zhang and Taylor, 2018), which could be adapted for longitudinal data, for instance using machine learning approaches (Hu and Greene, 2018, van Dijk et al, 2018, Lin et al, 2017, Amodio et al, 2017, Schiebinger et al, 2017, Schiebinger et al, 2019) to help map cell lineages and trajectories particular to pediatric development.…”

Section: Expected Challenges and Timelinesmentioning

confidence: 99%

The Pediatric Cell Atlas: Defining the Growth Phase of Human Development at Single-Cell Resolution

et al. 2019

Self Cite

View full text Add to dashboard Cite

Single-cell gene expression analyses of mammalian tissues have uncovered profound stage-specific molecular regulatory phenomena that have changed the understanding of unique cell types and signaling pathways critical for lineage determination, morphogenesis, and growth. We discuss here the case for a Pediatric Cell Atlas as part of the Human Cell Atlas consortium to provide single-cell profiles and spatial characterization of gene expression across human tissues and organs. Such data will complement adult and developmentally focused HCA projects to provide a rich cytogenomic framework for understanding not only pediatric health and disease but also environmental and genetic impacts across the human lifespan.

show abstract

Section: Expected Challenges and Timelinesmentioning

confidence: 99%

The Pediatric Cell Atlas: Defining the Growth Phase of Human Development at Single-Cell Resolution

et al. 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…An extensive empirical study has been carried out by comparing the performance of contrastive-sc with 11 alternative techniques, representing both methods requiring or not the number of clusters as input. ScziDesk [19], scDeepClustering [16], scRNA [11], cidr [8] and soup [12] take as input the expected number of clusters while Seurat [13] (scanpy [14] implementation), desc [18], scedar [32], raceid [33] and scvi [20] perform clustering without any alternative information. Additionally, a naive baseline method consisting of clustering with KMeans the first 2 principal components of the expression matrix has been assessed.…”

Section: Competing Methodsmentioning

confidence: 99%

Contrastive self-supervised clustering of scRNA-seq data

Ciortan

Defrance

2021

BMC Bioinformatics

View full text Add to dashboard Cite

Background Single-cell RNA sequencing (scRNA-seq) has emerged has a main strategy to study transcriptional activity at the cellular level. Clustering analysis is routinely performed on scRNA-seq data to explore, recognize or discover underlying cell identities. The high dimensionality of scRNA-seq data and its significant sparsity accentuated by frequent dropout events, introducing false zero count observations, make the clustering analysis computationally challenging. Even though multiple scRNA-seq clustering techniques have been proposed, there is no consensus on the best performing approach. On a parallel research track, self-supervised contrastive learning recently achieved state-of-the-art results on images clustering and, subsequently, image classification. Results We propose contrastive-sc, a new unsupervised learning method for scRNA-seq data that perform cell clustering. The method consists of two consecutive phases: first, an artificial neural network learns an embedding for each cell through a representation training phase. The embedding is then clustered in the second phase with a general clustering algorithm (i.e. KMeans or Leiden community detection). The proposed representation training phase is a new adaptation of the self-supervised contrastive learning framework, initially proposed for image processing, to scRNA-seq data. contrastive-sc has been compared with ten state-of-the-art techniques. A broad experimental study has been conducted on both simulated and real-world datasets, assessing multiple external and internal clustering performance metrics (i.e. ARI, NMI, Silhouette, Calinski scores). Our experimental analysis shows that constastive-sc compares favorably with state-of-the-art methods on both simulated and real-world datasets. Conclusion On average, our method identifies well-defined clusters in close agreement with ground truth annotations. Our method is computationally efficient, being fast to train and having a limited memory footprint. contrastive-sc maintains good performance when only a fraction of input cells is provided and is robust to changes in hyperparameters or network architecture. The decoupling between the creation of the embedding and the clustering phase allows the flexibility to choose a suitable clustering algorithm (i.e. KMeans when the number of expected clusters is known, Leiden otherwise) or to integrate the embedding with other existing techniques.

show abstract

“…Our experimental setup compared the performance of graph-sc with 12 competing methods, representative of both scenarios. ScziDesk (Chen et al, 2020), scDeepClustering (Tian et al, 2019), scRNA (Mieth et al, 2019), cidr (Lin et al, 2017) and soup (Zhu et al, 2019) take as input the expected number of clusters while scGNN (Wang et al, 2021), Seurat (Satija et al, 2015), scanpy (Wolf et al, 2018) implementation), desc (Li et al, 2020), scedar (Zhang et al, 2020), raceid (Muraro et al, 2016) and scvi (Lopez et al, 2018) perform clustering without any alternative information. In addition, 6 naive baselines (depicted in gray in all our plots) consisting of clustering with K-means the following dimensionality reduced version of the expression matrix were assessed: the first 2 (labeled pca2_kmeans) and 50 (labelled pca50_kmeans) principal components of X, the first 20 (umap20_kmeans) or 50 (umap50_kmeans) UMAP, the first 2 UMAP components of the 50 PCA (pca50_umap_kmeans) of X and with Leiden the best performing baseline, the 2 UMAP components of the 50 PCA of X (labelled pca50_umap_leiden).…”

Section: Competing Methodsmentioning

confidence: 99%

GNN-based embedding for clustering scRNA-seq data

Ciortan

Defrance

2021

Bioinformatics

View full text Add to dashboard Cite

Motivation Single-cell RNA sequencing (scRNA-seq) provides transcriptomic profiling for individual cells, allowing researchers to study the heterogeneity of tissues, recognize rare cell identities and discover new cellular subtypes. Clustering analysis is usually used to predict cell class assignments and infer cell identities. However, the high sparsity of scRNA-seq data, accentuated by dropout events generates challenges that have motivated the development of numerous dedicated clustering methods. Nevertheless, there is still no consensus on the best performing method. Results graph-sc is a new method leveraging a graph autoencoder network to create embeddings for scRNA-seq cell data. While this work analyzes the performance of clustering the embeddings with various clustering algorithms, other downstream tasks can also be performed. A broad experimental study has been performed on both simulated and scRNA-seq datasets. The results indicate that although there is no consistently best method across all the analyzed datasets, graph-sc compares favorably to competing techniques across all types of datasets. Furthermore, the proposed method is stable across consecutive runs, robust to input down-sampling, generally insensitive to changes in the network architecture or training parameters and more computationally efficient than other competing methods based on neural networks. Modeling the data as a graph provides increased flexibility to define custom features characterizing the genes, the cells and their interactions. Moreover, external data (e.g. gene network) can easily be integrated into the graph and used seamlessly under the same optimization task. Availability and implementation https://github.com/ciortanmadalina/graph-sc. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Scedar: A scalable Python package for single-cell RNA-seq exploratory data analysis

Cited by 12 publications

References 72 publications

The Pediatric Cell Atlas: Defining the Growth Phase of Human Development at Single-Cell Resolution

The Pediatric Cell Atlas: Defining the Growth Phase of Human Development at Single-Cell Resolution

Contrastive self-supervised clustering of scRNA-seq data

GNN-based embedding for clustering scRNA-seq data

Contact Info

Product

Resources

About