Background: Many cancer genomes have been known to contain more than one subclone inside one tumor, the phenomenon of which is called intra-tumor heterogeneity (ITH). Characterizing ITH is essential in designing treatment plans, prognosis as well as the study of cancer progression. Single-cell DNA sequencing (scDNAseq) has been proven effective in deciphering ITH. Cells corresponding to each subclone are supposed to carry a unique set of mutations such as single nucleotide variations (SNV). While there have been many studies on the cancer evolutionary tree reconstruction, not many have been proposed that simply characterize the subclonality without tree reconstruction. While tree reconstruction is important in the study of cancer evolutionary history, typically they are computationally expensive in terms of running time and memory consumption due to the huge search space of the tree structure. On the other hand, subclonality characterization of single cells can be converted into a cell clustering problem, the dimension of which is much smaller, and the turnaround time is much shorter. Despite the existence of a few state-of-the-art cell clustering computational tools for scDNAseq, there lacks a comprehensive and objective comparison under different settings.
Results: In this paper, we benchmarked three state-of-the-art cell clustering tools–SCG, BnpC and SCClone–on simulated datasets given a variety of parameter settings and a real dataset. We designed a simulator specifically for cell clustering, and compared the three methods' performances in terms of their clustering accuracy, genotyping accuracy and running time.
Conclusion: From the benchmark study, we conclude that BnpC's clustering accuracy is the highest of all three methods. SCG's accuracy is very close to BnpC's, but it is much faster than the other two methods especially when the cell number is within 1000. When there are a large number of single cells (> 1500), BnpC is highly recommended due to its scalability while not sacrificing the clustering accuracy. Of the three methods, SCClone has the highest accuracy in estimating the number of clusters especially when the underlying number of cluster is high. When the variance of the cluster sizes is high, all three methods' clustering accuracy drops. To improve the clustering accuracy while cluster sizes' variance is high is potentially a future work in scDNAseq cell clustering.