Phylogenetic factorization of compositional data yields lineage-level associations in microbiome datasets

Washburne, Alex D.; Silverman, Justin D.; Leff, Jonathan W.; Bennett, Dominic J.; Darcy, John L.; Mukherjee, Sayan; Fierer, Noah; David, Lawrence A.

doi:10.7717/peerj.2969

Cited by 110 publications

(159 citation statements)

References 42 publications

Supporting

Mentioning

154

Contrasting

Order By: Relevance

“…Hence, even when using normalized samples, 317 the pitfalls regarding correlations of compositional data [59] do not apply here. 318 The method further bears some conceptual similarity to Phylofactorization [31], for 319 which we later present an adaptation to phylogenetic placements, called 320 Placement-Factorization. Phylofactorization also takes meta-data features into account 321 and can thereby identify relationships between changes in environmental variables and 322 changes in abundances in clades of the tree.…”

mentioning

confidence: 99%

“…The concepts and methods presented above resemble two recent approaches for 455 analyzing phylogenetic data: the Phylogenetic Isometric Log-Ratio (PhILR) 456 transformation and balances [30], as well as Phylogenetic Factorization 457 (Phylofactorization) [31]. These methods use a tree inferred from the OTU sequences of 458 the samples (instead of a fixed reference tree), and annotate the abundances of OTUs 459 per sample on the tips of this tree (instead of placement masses on the branches).…”

mentioning

confidence: 99%

“…The main adaptation step consists 484 in placing masses on the branches of our (fixed) reference tree, instead of only 485 considering masses (abundances) at the tips of the OTU tree. Here, we focus on 486 balances that contrast the subtrees induced by edges of the tree, as used by 487 Phylofactorization [31], because this is more natural in the context of phylogenetic 488 placement data. The same concepts could however also be employed for subtrees below 489 nodes, as used by the PhILR transform [30].…”

mentioning

confidence: 99%

“…In the original PhILR, balances are calculated for the two subtrees below a given 533 node of the tree [30]. In the context of Phylofactorization, this has been generalized to 534 balances between any two disjoint sets R and S of taxa (tips of the tree) [31]. We here 535 build on the latter, but again change R and S to refer to disjoint sets of edges of our 536 reference tree.…”

mentioning

confidence: 99%

“…Phylofactorization is a method to identify edges in a phylogenetic tree that drive 596 patterns in the composition of microbial communities [31]. An edge constitutes a 597 separation or split of groups of taxa into the two subtrees induced by the edge.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Scalable methods for analyzing and visualizing phylogenetic placement of metagenomic samples

Czech

Stamatakis

2018

Preprint

View full text Add to dashboard Cite

The exponential decrease in molecular sequencing cost generates unprecedented amounts of data. Hence, scalable methods to analyze these data are required. Phylogenetic (or Evolutionary) Placement methods identify the evolutionary provenance of anonymous sequences with respect to a given reference phylogeny. This increasingly popular method is deployed for scrutinizing metagenomic samples from environments such as water, soil, or the human gut.Here, we present novel and, more importantly, highly scalable methods for analyzing phylogenetic placements of metagenomic samples. More specifically, we introduce methods for (a) visualizing differences between samples and their correlation with associated meta-data on the reference phylogeny, (b) clustering similar samples using a variant of the k-means method, and (c) finding phylogenetic factors using an adaptation of the Phylofactorization method. These methods enable to interpret metagenomic data in a phylogenetic context, to find patterns in the data, and to identify branches of the phylogeny that are driving these patterns.To demonstrate the scalability and utility of our methods, as well as to provide exemplary interpretations of our methods, we applied them to 3 publicly available datasets comprising 9782 samples with a total of approximately 168 million sequences. The results indicate that new biological insights can be attained via our methods. Introduction 1The availability of high-throughput DNA sequencing technologies has revolutionized 2 biology by transforming it into an ever more data-driven and compute-intense 3 discipline [1]. In particular, Next Generation Sequencing (NGS) [2], as well as later 4 generations [3][4][5][6], have given rise to novel methods for studying microbial 5 environments [7][8][9][10]. These technologies are often used in metagenomic studies to 6 sequence organisms in water [11][12][13] or soil [14,15] samples, in the human 7 microbiome [16][17][18], and a plethora of other environments. These studies yield a large 8 set of short anonymous DNA sequences, so-called reads, for each sample. Reads that are 9 obtained from specific parts of the genome are called meta-barcoding reads; most often, 10 reads are amplified before sequencing and later de-replicated again, resulting in 11 so-called amplicons. A typical task in metagenomic studies is to identify and classify 12 1/72 these sequences with respect to known reference sequences, either in a taxonomic or a 13 phylogenetic context. 14 Conventional methods like BLAST [19] are based on sequence similarity or identity. 15 Such methods are fast, but only attain satisfying accuracy levels if the query sequences 16 (e.g., the environmental reads or amplicons) are sufficiently similar to the reference 17 sequences. Furthermore, BLAST might yield suboptimal results [20], and the best 18 BLAST hit does often not represent the most closely related species [21]. 19 Alternatively, so-called phylogenetic (or evolutionary) placement methods [22][23][24] 20 identify query sequences based on a...

show abstract

mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

Scalable methods for analyzing and visualizing phylogenetic placement of metagenomic samples

Czech

Stamatakis

2018

Preprint

View full text Add to dashboard Cite

show abstract

Phylofactorization: a graph partitioning algorithm to identify phylogenetic scales of ecological data

Washburne

Silverman

Morton

et al. 2019

Ecological Monographs

Self Cite

102

View full text Add to dashboard Cite

The problem of pattern and scale is a central challenge in ecology. In community ecology, an important scale is that at which we aggregate species to define our units of study, such as aggregation of “nitrogen fixing trees” to understand patterns in carbon sequestration. With the emergence of massive community ecological data sets, there is a need to objectively identify the scales for aggregating species to capture well‐defined patterns in community ecological data. The phylogeny is a scaffold for identifying scales of species‐aggregation associated with macroscopic patterns. Phylofactorization was developed to identify phylogenetic scales underlying patterns in relative abundance data, but many ecological data, such as presence‐absences and counts, are not relative abundances yet may still have phylogenetic scales capturing patterns of interest. Here, we broaden phylofactorization to a graph‐partitioning algorithm identifying phylogenetic scales in community ecological data. As a graph‐partitioning algorithm, phylofactorization connects many tools from data analysis to phylogenetically informed analyses of community ecological data. Two‐sample tests identify five phylogenetic factors of mammalian body mass which arose during the K‐Pg extinction event, consistent with other analyses of mammalian body mass evolution. Projection of data onto coordinates connecting the phylogeny and graph‐partitioning algorithm yield a phylogenetic principal components analysis which refines our understanding of the major sources of variation in the human gut microbiome. These same coordinates allow generalized additive modeling of microbes in Central Park soils, confirming that a large clade of Acidobacteria thrive in neutral soils. The graph‐partitioning algorithm extends to generalized linear and additive modeling of exponential family random variables by phylogenetically constrained reduced‐rank regression or stepwise factor contrasts. All of these tools can be implemented with the R package phylofactor.

show abstract

Rare microbial taxa emerge when communities collide: freshwater and marine microbiome responses to experimental mixing

Rocca

Simonin

Bernhardt

et al. 2020

Ecology

Self Cite

View full text Add to dashboard Cite

Whole microbial communities regularly merge with one another, often in tandem with their environments, in a process called community coalescence. Such events impose substantial changes: abiotic perturbation from environmental blending and biotic perturbation of community merging. We used an aquatic mixing experiment to unravel the effects of these perturbations on the whole microbiome response and on the success of individual taxa when distinct freshwater and marine communities coalesce. We found that an equal mix of freshwater and marine habitats and blended microbiomes resulted in strong convergence of the community structure toward that of the marine microbiome. The enzymatic potential of these blended microbiomes in mixed media also converged toward that of the marine, with strong correlations between the multivariate response patterns of the enzymes and of community structure. Exposing each endmember inocula to an axenic equal mix of their freshwater and marine source waters led to a 96% loss of taxa from our freshwater microbiomes and a 66% loss from our marine microbiomes. When both inocula were added together to this mixed environment, interactions amongst the communities led to a further loss of 29% and 49% of freshwater and marine taxa, respectively. Under both the axenic and competitive scenarios, the diversity lost was somewhat counterbalanced by increased abundance of microbial taxa that were too rare to detect in the initial inocula. Our study emphasizes the importance of the rare biosphere as a critical component of microbial community responses to community coalescence.

show abstract

Phylogenetic factorization of compositional data yields lineage-level associations in microbiome datasets

Cited by 110 publications

References 42 publications

Scalable methods for analyzing and visualizing phylogenetic placement of metagenomic samples

Scalable methods for analyzing and visualizing phylogenetic placement of metagenomic samples

Phylofactorization: a graph partitioning algorithm to identify phylogenetic scales of ecological data

Rare microbial taxa emerge when communities collide: freshwater and marine microbiome responses to experimental mixing

Contact Info

Product

Resources

About