Next Generation Sequencing (NGS) technologies have led to a ubiquity of molecular sequence data. This data avalanche is particularly challenging in metagenetics, which focuses on taxonomic identification of sequences obtained from diverse microbial environments. Phylogenetic placement methods determine how these sequences fit into an evolutionary context. Previous implementations of phylogenetic placement algorithms, such as the Evolutionary Placement Algorithm (EPA) included in RAxML, or pplacer, are being increasingly used for this purpose. However, due to the steady progress in NGS technologies, the current implementations face substantial scalability limitations. Here we present EPA-ng, a complete reimplementation of the EPA that is substantially faster, offers a distributed memory parallelization, and integrates concepts from both, RAxML-EPA and pplacer. EPA-ng can be executed on standard shared memory, as well as on distributed memory systems (e.g., computing clusters). To demonstrate the scalability of EPA-ng we placed 1 billion metagenetic reads from the Tara Oceans Project onto a reference tree with 3;748 taxa in just under 7 hours, using 2;048 cores. Our performance assessment shows that EPA-ng outperforms RAxML-EPA and pplacer by up to a factor of 30 in sequential execution mode, while attaining comparable parallel efficiency on shared memory systems. We further show that the distributed memory parallelization of EPA-ng scales well up to 2;048 cores. EPA-ng is available under the AGPLv3 license: https://github.com/Pbdas/epa-ng.
Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org. Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising a quality-filtered subset of 8, 736 out of all 16, 453 virus sequences available on May 5, 2020 from gisaid.org. We find that it is difficult to infer a reliable phylogeny on these data due to the large number of sequences in conjunction with the low number of mutations. We further find that rooting the inferred phylogeny with some degree of confidence either via the bat and pangolin outgroups or by applying novel computational methods on the ingroup phylogeny does not appear to be credible. Finally, an automatic classification of the current sequences into sub-classes using the mPTP tool for molecular species delimitation is also, as might be expected, not possible, as the sequences are too closely related. We conclude that, although the application of phylogenetic methods to disentangle the evolution and spread of COVID-19 provides some insight, results of phylogenetic analyses, in particular those conducted under the default settings of current phylogenetic inference tools, as well as downstream analyses on the inferred phylogenies, should be considered and interpreted with extreme caution.
Next Generation Sequencing (NGS) technologies have led to a ubiquity of 13 molecular sequence data. This data avalanche is particularly challenging in metagenetics, 14 which focuses on taxonomic identification of sequences obtained from diverse microbial 15 environments. To achieve this, phylogenetic placement methods determine how these 16 sequences fit into an evolutionary context. Previous implementations of phylogenetic 17 placement algorithms, such as the Evolutionary Placement Algorithm (EPA) included in 18 RAxML, or pplacer, are being increasingly used for this purpose. However, due to the 19 steady progress in NGS technologies, the current implementations face substantial 20 scalability limitations. Here we present EPA-ng, a complete reimplementation of the EPA 21 that is substantially faster, offers a distributed memory parallelization, and integrates 22 concepts from both, RAxML-EPA, and pplacer. EPA-ng can be executed on standard 23 shared memory, as well as on distributed memory systems (e.g., computing clusters). To 24 demonstrate the scalability of EPA-ng we placed 1 billion metagenetic reads from the 25 Tara Oceans Project onto a reference tree with 3,748 taxa in just under 7 hours, using 26 2,048 cores. Our performance assessment shows that EPA-ng outperforms RAxML-EPA 27 and pplacer by up to a factor of 30 in sequential execution mode, while attaining 28 comparable parallel efficiency on shared memory systems. We further show that the 29 distributed memory parallelization of EPA-ng scales well up to 3,520 cores. EPA-ng is 30 available under the AGPLv3 license: https://github.com/Pbdas/epa-ng 31 (Keywords: phylogenetics; phylogenetic placement; metagenomics; metabarcoding; 32 microbiome) 33 In the last decade, advances in genetic sequencing technologies have drastically 34 reduced the price for decoding DNA and dramatically increased the amount of available 35 DNA data. The Tara Oceans Project (Sunagawa et al. 2015), for example, has generated 36 hundreds of billions of environmental sequences. Moreover, sequencing costs are decreasing 37at a significantly higher rate than computers are becoming faster according to Moore's law. 38Therefore, state-of-the art Bioinformatics software is facing a grand scalability challenge. 39A common metagenetic data analysis step is to infer the microbiological 40 composition of a given sample. This can be done, for instance, by determining the best hit 41 for each query sequence (QS) in a database of reference sequences (RSs), using sequence 42 similarity measures, and by subsequently assigning the taxonomic label of the chosen RS to 43 the QS. However, approaches based on sequence similarity do neither provide, nor use, 44 phylogenetic information about the QS. This can decrease identification accuracy (Koski 45 and Golding 2001), especially when the QSs are only distantly related to the RSs, for 46 example when more closely related QS are simply not available. 47 Phylogenetic placement algorithms alleviate this problem by placing a QS onto a 48 reference t...
Summary We present genesis, a library for working with phylogenetic data, and gappa, an accompanying command-line tool for conducting typical analyses on such data. The tools target phylogenetic trees and phylogenetic placements, sequences, taxonomies and other relevant data types, offer high-level simplicity as well as low-level customizability, and are computationally efficient, well-tested and field-proven. Availability and implementation Both genesis and gappa are written in modern C++11, and are freely available under GPLv3 at http://github.com/lczech/genesis and http://github.com/lczech/gappa. Supplementary information Supplementary data are available at Bioinformatics online.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.