Recent advances in data acquiring technologies in biology have led to major challenges in mining relevant information from large datasets. For example, single-cell RNA sequencing technologies are producing expression and sequence information from tens of thousands of cells in every single experiment. A common task in analyzing biological data is to cluster samples or features (e.g. genes) into groups sharing common characteristics. This is an NP-hard problem for which numerous heuristic algorithms have been developed. However, in many cases, the clusters created by these algorithms do not reflect biological reality. To overcome this, a Networks Based Clustering (NBC) approach was recently proposed, by which the samples or genes in the dataset are first mapped to a network and then community detection (CD) algorithms are used to identify clusters of nodes.Here, we created an open and flexible python-based toolkit for NBC that enables easy and accessible network construction and community detection. We then tested the applicability of NBC for identifying clusters of cells or genes from previously published large-scale single-cell and bulk RNA-seq datasets. We show that NBC can be used to accurately and efficiently analyze large-scale datasets of RNA sequencing experiments.
IntroductionAdvances in high-throughput genomic technologies have revolutionized the way biological data is being acquired. Technologies like DNA sequencing (DNA-seq), RNA sequencing (RNA-seq), chromatin immunoprecipitation sequencing (ChIP-seq), and mass cytometry are becoming standard components of modern biological research. The majority of these datasets are publicly available for further large-scale studies. Notable examples include the Genotype-Tissue Expression (GTEx) project 1 , the cancer genome atlas (TCGA) 2 , and the 1000 genomes project 3 . Examples of utilizing these datasets include studying allele-specific expression across tissues 4, 5 , characterizing functional variation in the human genome 6 , finding patterns of transcriptome variations across individuals and tissues 7 , and characterizing the global mutational landscape of cancer 8 . Moreover, some of these genomic technologies have recently been adapted to work at the single-cell level 9 . While pioneering single-cell RNA sequencing (scRNA-seq) studies were able to process relatively small numbers of cells (42 cells in 10 and 18 cells in 11 ), recent single-cell RNA-seq studies taking advantage of automation and nanotechnology were able to produce expression and sequence data from many thousands of individual cells (∼1,500 cells in 12 and ∼40,000 cells in 13 ). Hence, biology is facing significant challenges in handling and analyzing large complex datasets 14,15 .
Clustering analysisOne of the common methods used for making sense of large biological datasets is cluster analysis: the task of grouping similar samples or features 16 . For example, clustering analysis has been used to identify subtypes of breast tumors 17, 18 with implications to treatment and prognosis. More r...