2022
DOI: 10.1371/journal.pcbi.1010056
|View full text |Cite
|
Sign up to set email alerts
|

phastSim: Efficient simulation of sequence evolution for pandemic-scale datasets

Abstract: Sequence simulators are fundamental tools in bioinformatics, as they allow us to test data processing and inference tools, and are an essential component of some inference methods. The ongoing surge in available sequence data is however testing the limits of our bioinformatics software. One example is the large number of SARS-CoV-2 genomes available, which are beyond the processing power of many methods, and simulating such large datasets is also proving difficult. Here, we present a new algorithm and software… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
27
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
1

Relationship

2
4

Authors

Journals

citations
Cited by 20 publications
(27 citation statements)
references
References 40 publications
0
27
0
Order By: Relevance
“… Turakhia et al (2021) propose the Mutation Annotated Tree (MAT) format (consisting of a Newick tree and associated mutations in a binary format) and the program as an efficient way to store and process large viral datasets ( McBroome et al 2021 ), achieving excellent compression and processing performance. Similarly, ( De Maio et al 2021 ) was developed to simulate sequence evolution on such large SARS-CoV-2 phylogenies, and also outputs a Newick tree annotated with mutations (not in MAT format) to avoid the bottleneck of generating and storing the simulated sequences. While these methods illustrate the advantages of the general approach of storing ancestry and mutations rather than sequences, they do not generalize beyond their immediate settings, and no software library support is available.…”
Section: Resultsmentioning
confidence: 99%
“… Turakhia et al (2021) propose the Mutation Annotated Tree (MAT) format (consisting of a Newick tree and associated mutations in a binary format) and the program as an efficient way to store and process large viral datasets ( McBroome et al 2021 ), achieving excellent compression and processing performance. Similarly, ( De Maio et al 2021 ) was developed to simulate sequence evolution on such large SARS-CoV-2 phylogenies, and also outputs a Newick tree annotated with mutations (not in MAT format) to avoid the bottleneck of generating and storing the simulated sequences. While these methods illustrate the advantages of the general approach of storing ancestry and mutations rather than sequences, they do not generalize beyond their immediate settings, and no software library support is available.…”
Section: Resultsmentioning
confidence: 99%
“…Simulating an alignment based on a known tree ensures that there is a ground truth for comparison to definitively assess each optimization method. We used an inferred global phylogeny as a template to simulate a complete multiple sequence alignment using phastSim (De Maio et al 2021b). We subsampled this simulated alignment into 50 progressively larger sets of samples, ranging in number of samples from 4,676 to 233,326 (see Methods), to examine each of the three optimization methods in both online and de novo phylogenetics.…”
Section: Resultsmentioning
confidence: 99%
“…(De Maio et al 2021a), with position-specific mean mutation rates sampled from a gamma distribution with alpha=beta=4, and with 1% of the genome having a 10-fold increase mutation rate for one specific mutation type (SARS-CoV-2 hypermutability model described in ref. (De Maio et al 2021b)). Evolution of coding regions was simulated with the same neutral mutational distribution, with a mean nonsynonymous/synonymous rate ratio of omega=0.48 as estimated in (Turakhia et al 2021a), with codon-specific omega values sampled from a gamma distribution with alpha=0.96 and beta=2.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Existing simulators often require long runtimes and a lot of memory to generate MSAs with millions of sequences or sites. The only exception to this is the recently-introduced phastSim ( De Maio et al 2022 ), designed to simulate alignments of hundreds of thousands of genomes from viruses such as SARS-CoV-2.…”
Section: Introductionmentioning
confidence: 99%