The exponential decrease in molecular sequencing cost generates unprecedented amounts of data. Hence, scalable methods to analyze these data are required. Phylogenetic (or Evolutionary) Placement methods identify the evolutionary provenance of anonymous sequences with respect to a given reference phylogeny. This increasingly popular method is deployed for scrutinizing metagenomic samples from environments such as water, soil, or the human gut.Here, we present novel and, more importantly, highly scalable methods for analyzing phylogenetic placements of metagenomic samples. More specifically, we introduce methods for (a) visualizing differences between samples and their correlation with associated meta-data on the reference phylogeny, (b) clustering similar samples using a variant of the k-means method, and (c) finding phylogenetic factors using an adaptation of the Phylofactorization method. These methods enable to interpret metagenomic data in a phylogenetic context, to find patterns in the data, and to identify branches of the phylogeny that are driving these patterns.To demonstrate the scalability and utility of our methods, as well as to provide exemplary interpretations of our methods, we applied them to 3 publicly available datasets comprising 9782 samples with a total of approximately 168 million sequences. The results indicate that new biological insights can be attained via our methods.
Introduction 1The availability of high-throughput DNA sequencing technologies has revolutionized 2 biology by transforming it into an ever more data-driven and compute-intense 3 discipline [1]. In particular, Next Generation Sequencing (NGS) [2], as well as later 4 generations [3][4][5][6], have given rise to novel methods for studying microbial 5 environments [7][8][9][10]. These technologies are often used in metagenomic studies to 6 sequence organisms in water [11][12][13] or soil [14,15] samples, in the human 7 microbiome [16][17][18], and a plethora of other environments. These studies yield a large 8 set of short anonymous DNA sequences, so-called reads, for each sample. Reads that are 9 obtained from specific parts of the genome are called meta-barcoding reads; most often, 10 reads are amplified before sequencing and later de-replicated again, resulting in 11 so-called amplicons. A typical task in metagenomic studies is to identify and classify 12 1/72 these sequences with respect to known reference sequences, either in a taxonomic or a 13 phylogenetic context. 14 Conventional methods like BLAST [19] are based on sequence similarity or identity. 15 Such methods are fast, but only attain satisfying accuracy levels if the query sequences 16 (e.g., the environmental reads or amplicons) are sufficiently similar to the reference 17 sequences. Furthermore, BLAST might yield suboptimal results [20], and the best 18 BLAST hit does often not represent the most closely related species [21]. 19 Alternatively, so-called phylogenetic (or evolutionary) placement methods [22][23][24] 20 identify query sequences based on a...