Background. Single cell sequencing (SCS) technologies provide a level of resolution that makes it indispensable for inferring from a sequenced tumor, evolutionary trees or phylogenies representing an accumulation of cancerous mutations. A drawback of SCS is elevated false negative and missing value rates, resulting in a large space of possible solutions, which in turn makes infeasible using some approaches and tools. While this has not inhibited the development of methods for inferring phylogenies from SCS data, the continuing increase in size and resolution of these data begin to put a strain on such methods. One possible solution is to reduce the size of an SCS instance -usually represented as a matrix of presence, absence and missing values of the mutations found in the different sequenced cells -and infer this tree from the reduced-size instance. Previous approaches have used k-means to this end, clustering groups of mutations and/or cells, and using these means as the reduced instance.Results. In this work, we benchmark a variety of commonly used methods aimed at clustering vector, or matrix data -here representing SCS instances: with a focus on how effective these methods are for the purpose of inferring tumor evolutionary trees from these SCS instances. A trend we observe is that those methods designed for clustering data of a categorical nature (SCS data having three categories: present, absent and missing) -namely k-modes, and a method we have devised called celluloid (see Methods) -perform much better than all of the other methods. We demonstrate that these categorical methods cluster mutations with high precision: never pairing too many mutations that are unrelated in the ground truth, but also obtain accurate results in terms of the phylogeny inferred downstream from the reduced instance produced by such a method. Finally, we demonstrate the usefulness of a clustering step by applying the entire pipeline (clustering + inference method) to a real dataset, showing a significant reduction in the runtime, raising considerably the upper bound on the size of SCS instances which can be solved in practice.
BackgroundA tumor, at the time of detection, usually by performing a biopsy on the extracted tissue, is the result of a tumultuous evolutionary process, originating from a single tumor cell -the founder cell [23] -that has acquired a driver mutation, which inhibits control on the proliferation of subsequent cancer cells. From that moment, the combination of unrestrained proliferation and a very hostile environment -as the immune system fights for survival, and so the tumor cells, under extreme selection pressure, have to disguise themselves to avoid being attacked, compete with each other, all while having to thrive while getting low levels of oxygen -produces the accumulation of highly elevated number of mutations, including structural variations. This model of tumor evolution is called the clonal model [23], since a clone is a population of cells carrying the same set of mutations. Understanding this clon...