A Randomized Algorithm for Comparing Sets of Phylogenetic Trees

Sul, Seung-Jin; Williams, Tiffani L.

doi:10.1142/9781860947995_0015

Cited by 14 publications

(10 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the question arises how to select a hash function for the hash key, which in our case is simply the bipartition vector. The usage of universal hash functions (Carter and Wegman, 1977) as advocated in some more theoretical papers (Sul and Williams, 2007;Sul et al, 2008;Amenta et al, 2003) is highly questionable: firstly, because the computation of a universal hash function given a bit vector of length n is slow, and secondly, universal hash functions only work well when hash keys are equally randomly distributed (Carter and Wegman, 1977), which is not very likely for hash keys that are induced by a hierarchical data structure such as a tree. Those two practical performance considerations have not been addressed in the aforementioned articles.…”

Section: Application Of Bipartition Hashingmentioning

confidence: 99%

How Many Bootstrap Replicates Are Necessary?

Pattengale

Alipour

Bininda‐Emonds

et al. 2010

Journal of Computational Biology

739

385

View full text Add to dashboard Cite

Phylogenetic bootstrapping (BS) is a standard technique for inferring confidence values on phylogenetic trees that is based on reconstructing many trees from minor variations of the input data, trees called replicates. BS is used with all phylogenetic reconstruction approaches, but we focus here on one of the most popular, maximum likelihood (ML). Because ML inference is so computationally demanding, it has proved too expensive to date to assess the impact of the number of replicates used in BS on the relative accuracy of the support values. For the same reason, a rather small number (typically 100) of BS replicates are computed in real-world studies. Stamatakis et al. recently introduced a BS algorithm that is 1 to 2 orders of magnitude faster than previous techniques, while yielding qualitatively comparable support values, making an experimental study possible. In this article, we propose stopping criteria--that is, thresholds computed at runtime to determine when enough replicates have been generated--and we report on the first large-scale experimental study to assess the effect of the number of replicates on the quality of support values, including the performance of our proposed criteria. We run our tests on 17 diverse real-world DNA--single-gene as well as multi-gene--datasets, which include 125-2,554 taxa. We find that our stopping criteria typically stop computations after 100-500 replicates (although the most conservative criterion may continue for several thousand replicates) while producing support values that correlate at better than 99.5% with the reference values on the best ML trees. Significantly, we also find that the stopping criteria can recommend very different numbers of replicates for different datasets of comparable sizes. Our results are thus twofold: (i) they give the first experimental assessment of the effect of the number of BS replicates on the quality of support values returned through BS, and (ii) they validate our proposals for stopping criteria. Practitioners will no longer have to enter a guess nor worry about the quality of support values; moreover, with most counts of replicates in the 100-500 range, robust BS under ML inference becomes computationally practical for most datasets. The complete test suite is available at http://lcbb.epfl.ch/BS.tar.bz2, and BS with our stopping criteria is included in the latest release of RAxML v7.2.5, available at http://wwwkramer.in.tum.de/exelixis/software.html.

show abstract

Section: Application Of Bipartition Hashingmentioning

confidence: 99%

How Many Bootstrap Replicates Are Necessary?

Pattengale

Alipour

Bininda‐Emonds

et al. 2010

Journal of Computational Biology

739

385

View full text Add to dashboard Cite

show abstract

“…It is this distance computation that we parallelize in our case study. The distance metric itself is called Robinson-Foulds (RF) distance, and the fastest algorithm for all-to-all RF distance computation is the HashRF algorithm [19], introduced by a software package of the same name. 18 HashRF is about 2-3× as fast as Phy-Bin.…”

Section: Case Study: Phybin: All-to-all Tree Edit Distancementioning

confidence: 99%

Taming the parallel effect zoo

et al. 2014

View full text Add to dashboard Cite

A fundamental challenge of parallel programming is to ensure that the observable outcome of a program remains deterministic in spite of parallel execution. Language-level enforcement of determinism is possible, but existing deterministic-by-construction parallel programming models tend to lack features that would make them applicable to a broad range of problems. Moreover, they lack extensibility: it is difficult to add or change language features without breaking the determinism guarantee.The recently proposed LVars programming model, and the accompanying LVish Haskell library, took a step toward broadlyapplicable guaranteed-deterministic parallel programming. The LVars model allows communication through shared monotonic data structures to which information can only be added, never removed, and for which the order in which information is added is not observable. LVish provides a Par monad for parallel computation that encapsulates determinism-preserving effects while allowing a more flexible form of communication between parallel tasks than previous guaranteed-deterministic models provided.While applying LVar-based programming to real problems using LVish, we have identified and implemented three capabilities that extend its reach: inflationary updates other than least-upperbound writes; transitive task cancellation; and parallel mutation of non-overlapping memory locations. The unifying abstraction we use to add these capabilities to LVish-without suffering added complexity or cost in the core LVish implementation, or compromising determinism-is a form of monad transformer, extended to handle the Par monad. With our extensions, LVish provides the most broadly applicable guaranteed-deterministic parallel programming interface available to date. We demonstrate the viability of our approach both with traditional parallel benchmarks and with results from a real-world case study: a bioinformatics application that we parallelized using our extended version of LVish. 1 We refer here to external determinism, also called determinacy. Of course, many parallel applications depend critically on observably nondeterministic behavior-for example, hardware designs and GUIs. These are not candidates for deterministic execution, but that still leaves many that are.

show abstract

“…PhyBin reimplements the HashRF algorithm for full all-to-all Robinson Foulds distance (Sul & Williams, 2007), which is significantly faster than computing the distance matrix with repeated comparison of individual trees (e.g., PAUP (Swofford & Sullivan, 2003)). The HashRF algorithm is fast for today’s data sizes (e.g., hundreds of taxa and thousands of trees), but it scales much more poorly than the basic binning algorithm at significantly larger sizes.…”

Section: Description Of the Programmentioning

confidence: 99%

PhyBin: binning trees by topology

Newton

2013

PeerJ

View full text Add to dashboard Cite

A major goal of many evolutionary analyses is to determine the true evolutionary history of an organism. Molecular methods that rely on the phylogenetic signal generated by a few to a handful of loci can be used to approximate the evolution of the entire organism but fall short of providing a global, genome-wide, perspective on evolutionary processes. Indeed, individual genes in a genome may have different evolutionary histories. Therefore, it is informative to analyze the number and kind of phylogenetic topologies found within an orthologous set of genes across a genome. Here we present PhyBin: a flexible program for clustering gene trees based on topological structure. PhyBin can generate bins of topologies corresponding to exactly identical trees or can utilize Robinson-Fould’s distance matrices to generate clusters of similar trees, using a user-defined threshold. Additionally, PhyBin allows the user to adjust for potential noise in the dataset (as may be produced when comparing very closely related organisms) by pre-processing trees to collapse very short branches or those nodes not meeting a defined bootstrap threshold. As a test case, we generated individual trees based on an orthologous gene set from 10 Wolbachia species across four different supergroups (A–D) and utilized PhyBin to categorize the complete set of topologies produced from this dataset. Using this approach, we were able to show that although a single topology generally dominated the analysis, confirming the separation of the supergroups, many genes supported alternative evolutionary histories. Because PhyBin’s output provides the user with lists of gene trees in each topological cluster, it can be used to explore potential reasons for discrepancies between phylogenies including homoplasies, long-branch attraction, or horizontal gene transfer events.

show abstract

A Randomized Algorithm for Comparing Sets of Phylogenetic Trees

Cited by 14 publications

References 13 publications

How Many Bootstrap Replicates Are Necessary?

How Many Bootstrap Replicates Are Necessary?

Taming the parallel effect zoo

PhyBin: binning trees by topology

Contact Info

Product

Resources

About