Mixture Models of Nucleotide Sequence Evolution that Account for Heterogeneity in the Substitution Process Across Sites and Across Lineages

Jayaswal, Vivek; Wong, Thomas K. F.; Robinson, John; Poladian, L.; Jermiin, Lars S.

doi:10.1093/sysbio/syu036

Cited by 67 publications

(100 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…; Jayaswal et al . ). To examine whether compositional heterogeneity could be a source of model bias in our data sets, we selected a random subset of 40 data sets that exhibited poor model performance (including the 20 data sets for which PartitionFinder suggested the use of a single partition) and performed a chi‐squared test of homogeneity using the p4 v1.0 python package (Foster ).…”

Section: Methodsmentioning

confidence: 97%

Assessing the performance of DNA barcoding using posterior predictive simulations

Barley

Thomson

2016

Molecular Ecology

View full text Add to dashboard Cite

Accurate estimates of biodiversity are required for research in a broad array of biological subdisciplines including ecology, evolution, systematics, conservation and biodiversity science. The use of statistical models and genetic data, particularly DNA barcoding, has been suggested as an important tool for remedying the large gaps in our current understanding of biodiversity. However, the reliability of biodiversity estimates obtained using these approaches depends on how well the statistical models that are used describe the evolutionary process underlying the genetic data. In this study, we utilize data from the Barcode of Life Database and posterior predictive simulations to assess the performance of DNA barcoding under commonly used substitution models. We demonstrate that the success of DNA barcoding varies widely across DNA substitution models and that model choice has a substantial impact on the number of operational taxonomic units identified (changing results by ~4-31%). Additionally, we demonstrate that the widely followed practice of a priori assuming the Kimura 2-parameter model for DNA barcoding is statistically unjustified and should be avoided. Using both data-based and inference-based test statistics, we detect variation in model performance across taxonomic groups, clustering algorithms, genetic divergence thresholds and substitution models. Taken together, these results illustrate the importance of considering both model selection and model adequacy in studies quantifying biodiversity.

show abstract

Section: Methodsmentioning

confidence: 97%

Assessing the performance of DNA barcoding using posterior predictive simulations

Barley

Thomson

2016

Molecular Ecology

View full text Add to dashboard Cite

show abstract

“…There has been much effort in developing nonstationary, nonhomogeneous, or nonreversible models of nucleotide or amino acid substitution for use in inference of phylogenetic relationships among distant species, in both the maximumlikelihood (Yang and Roberts 1995;Galtier and Gouy 1998;Dutheil and Boussau 2008;Jayaswal et al 2011;Groussin et al 2013;Gueguen et al 2013;Jayaswal et al 2014) and Bayesian (Foster 2004;Lartillot 2006, 2008) frameworks. Here our focus is on estimation of substitution rates and counting of substitutions to study the process of sequence evolution, with the phylogeny assumed known.…”

mentioning

confidence: 99%

Evaluation of Ancestral Sequence Reconstruction Methods to Infer Nonstationary Patterns of Nucleotide Substitution

2015

View full text Add to dashboard Cite

Inference of gene sequences in ancestral species has been widely used to test hypotheses concerning the process of molecular sequence evolution. However, the approach may produce spurious results, mainly because using the single best reconstruction while ignoring the suboptimal ones creates systematic biases. Here we implement methods to correct for such biases and use computer simulation to evaluate their performance when the substitution process is nonstationary. The methods we evaluated include parsimony and likelihood using the single best reconstruction (SBR), averaging over reconstructions weighted by the posterior probabilities (AWP), and a new method called expected Markov counting (EMC) that produces maximum-likelihood estimates of substitution counts for any branch under a nonstationary Markov model. We simulated base composition evolution on a phylogeny for six species, with different selective pressures on G+C content among lineages, and compared the counts of nucleotide substitutions recorded during simulation with the inference by different methods. We found that large systematic biases resulted from (i) the use of parsimony or likelihood with SBR, (ii) the use of a stationary model when the substitution process is nonstationary, and (iii) the use of the Hasegawa-Kishino-Yano (HKY) model, which is too simple to adequately describe the substitution process. The nonstationary general time reversible (GTR) model, used with AWP or EMC, accurately recovered the substitution counts, even in cases of complex parameter fluctuations. We discuss model complexity and the compromise between bias and variance and suggest that the new methods may be useful for studying complex patterns of nucleotide substitution in large genomic data sets.

show abstract

“…This is a test to establish the biological plausibility (or otherwise) of each Lie Markov model. To realize the consistency advantages of Lie Markov models requires modeling nonhomogeneous evolution which is a difficult but not insurmountable problem, for example, Jayaswal et al (2014). We intend to take this step in a future article.…”

mentioning

confidence: 99%

A New Hierarchy of Phylogenetic Models Consistent with Heterogeneous Substitution Rates

2015

View full text Add to dashboard Cite

When the process underlying DNA substitutions varies across evolutionary history, some standard Markov models underlying phylogenetic methods are mathematically inconsistent. The most prominent example is the general time-reversible model (GTR) together with some, but not all, of its submodels. To rectify this deficiency, nonhomogeneous Lie Markov models have been identified as the class of models that are consistent in the face of a changing process of DNA substitutions regardless of taxon sampling. Some well-known models in popular use are within this class, but are either overly simplistic (e.g., the Kimura two-parameter model) or overly complex (the general Markov model). On a diverse set of biological data sets, we test a hierarchy of Lie Markov models spanning the full range of parameter richness. Compared against the benchmark of the ever-popular GTR model, we find that as a whole the Lie Markov models perform well, with the best performing models having 8–10 parameters and the ability to recognize the distinction between purines and pyrimidines.

show abstract

Mixture Models of Nucleotide Sequence Evolution that Account for Heterogeneity in the Substitution Process Across Sites and Across Lineages

Cited by 67 publications

References 52 publications

Assessing the performance of DNA barcoding using posterior predictive simulations

Assessing the performance of DNA barcoding using posterior predictive simulations

Evaluation of Ancestral Sequence Reconstruction Methods to Infer Nonstationary Patterns of Nucleotide Substitution

A New Hierarchy of Phylogenetic Models Consistent with Heterogeneous Substitution Rates

Contact Info

Product

Resources

About