2022
DOI: 10.1093/molbev/msac254
|View full text |Cite
|
Sign up to set email alerts
|

From Easy to Hopeless—Predicting the Difficulty of Phylogenetic Analyses

Abstract: Phylogenetic analyses under the Maximum Likelihood model are time and resource intensive. To adequately capture the vastness of tree space, one needs to infer multiple independent trees. On some datasets, multiple tree inferences converge to similar tree topologies, on others to multiple, topologically highly distinct yet statistically indistinguishable topologies. At present, no method exists to quantify and predict this behavior. We introduce a method to quantify the degree of difficulty for analyzing a data… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
3

Relationship

4
4

Authors

Journals

citations
Cited by 30 publications
(25 citation statements)
references
References 29 publications
0
14
0
Order By: Relevance
“…For both GBT and CNN classifiers, we observed a general trend for lower classification accuracy on more difficult MSAs according to the Pythia difficulty score. The higher the Pythia difficulty for an MSA, the lower the signal in the data and the more difficult it is to obtain a well-supported phylogeny as the likelihood surface exhibits multiple indistinguishable (by means of standard phylogenetic significance tests) likelihood peaks [21]. In addition to assessing the BACC as a function of the difficulty of simulated MSAs, we also assessed the BACC as a function of the difficulty of the underlying empirical MSAs.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…For both GBT and CNN classifiers, we observed a general trend for lower classification accuracy on more difficult MSAs according to the Pythia difficulty score. The higher the Pythia difficulty for an MSA, the lower the signal in the data and the more difficult it is to obtain a well-supported phylogeny as the likelihood surface exhibits multiple indistinguishable (by means of standard phylogenetic significance tests) likelihood peaks [21]. In addition to assessing the BACC as a function of the difficulty of simulated MSAs, we also assessed the BACC as a function of the difficulty of the underlying empirical MSAs.…”
Section: Resultsmentioning
confidence: 99%
“…For data collections simulating indel events, we also used the proportion of gaps as feature ( % gaps ). Further, we quantified the signal in the MSA using the difficulty of the respective phylogenetic analysis as predicted by Pythia [21] ( difficulty ), as well as the Shannon entropy [44] of the MSA ( Entropy ), a multinomial test statistic of the MSA ( Bollback multinomial ; [8]), and an entropy-like metric based on the number and frequency of patterns in the MSA ( Pattern entropy ). For further details on the computation of these metrics, we refer the interested reader to Supplementary Material Section 4.1.…”
Section: Methodsmentioning
confidence: 99%
“…We noticed that approximately half of the MSAs in TreeBASE contain at least two exactly identical sequences, and therefore decided to remove all duplicate sequences before training EBG. We selected the MSAs based on the Pythia difficulty [12]. Pythia quantifies the difficulty of phylogenetic analysis under the ML criterion.…”
Section: Training Datamentioning
confidence: 99%
“…RAxML-NG infers parsimony starting trees via a randomized stepwise addition order algorithm (−−start-option). The development of Pythia showed that by using the computationally substantially less expensive parsimony trees, we can accurately predict features of the ML tree space [12]. Due to the high prediction accuracy of Pythia, we therefore expect that parsimony-based features will also be useful for predicting SBS values.…”
Section: Feature Engineeringmentioning
confidence: 99%
“…The main disadvantage is that an additional step is required in the analysis (either an MCMC search or the inference of a bootstrap distribution), and accurately computing these distributions can be challenging, especially on large numbers of taxa where the number of tips in the single gene trees can be large compared to the length of the multiple sequence alignment. The degree of difficulty for a phylogenetic inference on a given MSA or essentially lack of signal for obtaining a stable single gene (family) tree can now be predicted via machine learning methods (Haag et al 2022). However, because the approximation needs only to be computed once for each gene family, ALE is faster than GeneRax for the purpose of evaluating different rooted species trees (or root positions on a single unrooted topology).…”
Section: Inference Of Gene Duplication Transfer and Loss Events Using...mentioning
confidence: 99%