Critical Assessment of Metagenome Interpretation: the second round of challenges

Meyer, Fernando; Fritz, Adrian; Deng, Zhi-Luo; Koslicki, David; Lesker, Till Robin; Gurevich, Alexey; Robertson, Gary; Alser, Mohammed; Antipov, Dmitry; Beghini, Francesco; Bertrand, Denis; Brito, Jaqueline J.; Brown, C. Titus; Buchmann, Jan P.; Buluç, Aydın; Chen, Bin; Chikhi, Rayan; Clausen, Philip Thomas Lanken Conradsen; Cristian, Alexandru; Dabrowski, Piotr Wojtek; Darling, Aaron E.; Egan, Rob; Eskin, Eleazar; Georganas, Evangelos; Goltsman, Eugene; Gray, Melissa A.; Hansen, Lars Hestbjerg; Hofmeyr, Steven; Huang, Pei; Irber, Luiz; Jia, Huijue; Jørgensen, Tue Sparholt; Kieser, Silas; Klemetsen, Terje; Kola, Axel; Kolmogorov, Mikhail; Korobeynikov, Anton; Kwan, Jason C.; LaPierre, Nathan; Lemaitre, Claire; Li, Chenhao; Limasset, Antoine; Miranda, Fábio; Mangul, Serghei; Marcelino, Vanessa R.; Marchet, Camille; Marijon, Pierre; Meleshko, Dmitry; Mende, Daniel; Milanese, Alessio; Nagarajan, Niranjan; Nissen, Jakob Nybo; Nurk, Sergey; Oliker, Leonid; Paoli, Lucas; Peterlongo, Pierre; Piro, Vitor C.; Porter, Jacob; Rasmussen, Simon; Rees, Evan R.; Reinert, Knut; Renard, Bernhard Y.; Robertsen, Espen Mikal; Rosen, Gail; Ruscheweyh, Hans‐Joachim; Sarwal, Varuni; Segata, Nicola; Seiler, Enrico; Shi, Lizhen; Sun, Fengzhu; Sunagawa, Shinichi; Sørensen, Søren J.; Thomas, Ashleigh; Tong, Chengxuan; Trajkovski, Mirko; Tremblay, Julien; Uritskiy, Gherman; Vicedomini, Riccardo; Wang, Zhengyang; Wang, Ziye; Wang, Zhong; Warren, Andrew; Willassen, Nils Peder; Yelick, Katherine; You, Ronghui; Zeller, Georg; Zhao, Zhengqiao; Zhu, Shanfeng; Zhu, Jun; Garrido‐Oter, Ruben; Gastmeier, Petra; Hacquard, Stéphane; Häußler, Susanne; Khaledi, Ariane; Maechler, Friederike; Mesny, Fantin; Radutoiu, Simona; Schulze‐Lefert, Paul; Smit, Nathiana; Strowig, Till; Bremges, Andreas; Sczyrba, Alexander; McHardy, Alice C.

doi:10.1038/s41592-022-01431-4

Cited by 212 publications

(244 citation statements)

References 74 publications

Supporting

Mentioning

237

Contrasting

Order By: Relevance

“…These previous studies have also used databases composed of only bacteria [1], only bacterial, archaeal, and viral genomes with complete assemblies [3], a MiniKraken database [1], or do not give full details on what is included in their database [2,4,5,7,9,17,18]. Additionally, while some other studies have used the NCBI non-redundant nucleotide database [6,8], we are not aware of any studies that have used the full NCBI RefSeq database that we have here, which is likely to have led to significantly worse performance in those previous comparisons (Fig. 4).…”

Section: Discussionmentioning

confidence: 99%

“…Due to the samples that were constructed by both the CAMI ( n =10) [5] and CAMI2 ( n =180) [6] studies using newly sequenced genomes, not all genomes – and therefore not all reads – have classifications at all taxonomic ranks. For some genomes, the lowest rank of the taxonomic classification given is at the family or genus level.…”

Section: Methodsmentioning

confidence: 99%

“…There are many metagenomic classification programs that have been developed to do this, and while these use a range of different methods, one common step is the comparison of unknown reads within a sample to a database of known genomes or sequences. Several studies have compared the performance of different metagenomic classifiers using simulated or mock communities with a known taxonomic composition [1][2][3][4] or a consensus approach [5,6] and publications introducing new metagenomic classifiers also often compare the method being introduced with previous methods [7][8][9]. The "best" metagenomic classifier is frequently determined based on either the F1 score (harmonic mean of precision and recall) or L1 distance (also known as Manhattan or Taxicab distance), and typically varies depending on the environment that a sample comes from as well as the complexity (number of taxa and magnitude of abundance differences) of the sample [1,8].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools

Wright

Comeau

Langille

2022

Preprint

View full text Add to dashboard Cite

In metagenomic analyses of microbiomes, one of the first steps is usually the taxonomic classification of reads by comparison to a database of previously taxonomically classified genomes. While different studies comparing metagenomic taxonomic classification methods have determined that different tools are "best", there are two tools that have been used the most to-date: Kraken (k-mer based classification against a user-constructed database) and MetaPhlAn (classification by alignment to clade-specific marker genes), the latest versions of which are Kraken2 and MetaPhlAn 3, respectively. We found large discrepancies in both the proportion of reads that were classified as well as the number of species that were identified when we used both Kraken2 and MetaPhlAn 3 to classify reads within metagenomes from human-associated or environmental datasets. We then investigated which of these tools would give classifications closest to the real composition of metagenomic samples using a range of simulated and mock samples and examined the combined impact of tool-parameter-database choice on the taxonomic classifications given. This revealed that there may not be a one-size-fits-all "best" choice. While Kraken2 can achieve better overall performance, with higher precision, recall and F1 scores, as well as alpha- and beta-diversity measures closer to the known composition than MetaPhlAn 3, the computational resources required for this may be prohibitive for many researchers, and the default database and parameters should not be used. We therefore conclude that the best tool-parameter-database choice for a particular application depends on the scientific question of interest, which performance metric is most important for this question and the limit of available computational resources.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools

Wright

Comeau

Langille

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In contrast, alignment-based methods show high tolerance for base variation. Marker gene approaches perform well in the identification of archaea and bacteria (Meyer et al ., 2022), while it is difficult to accurately identify viruses with this strategy since viruses do not have universally conserved genes, such as the 16S and 18S rRNA genes (Breitwieser et al ., 2019).…”

Section: Introductionmentioning

confidence: 99%

KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping

Shen

Xiang

Huang

et al. 2022

Preprint

View full text Add to dashboard Cite

A growing number of microbial reference genomes enable better metagenomic profiling accuracy yet put higher requirements on the indexing efficiency, database size, and runtime of taxonomic profilers. Besides, most profilers focused mainly on bacterial, archaeal, and fungal populations with less attention on viral communities. We present KMCP, a novel k-mer based metagenomic profiling tool that introduces genomic positions to k-mers by splitting the reference genomes into chunks. Benchmarking results on both simulated and real data demonstrate that KMCP not only allows for accurate taxonomic profiling of archaea, bacteria, and viral populations from metagenomic shotgun sequence data, but also provides confident pathogen detection for infectious clinical samples of low depth. KMCP is implemented in Go and is available as open-source software, under MIT, at https://github.com/shenwei356/kmcp.

show abstract

“…These assemblers include meta-IDBA (Peng et al, 2011), metaSPAdes (Nurk et al, 2017), MEGAHIT (Li et al, 2016), and many others. Several recent studies have provided a comprehensive comparison of the computational performance and accuracy of these assemblers (Sczyrba et al, 2017; Vollmers et al, 2017; Meyer et al, 2021). While most of these assemblers can efficiently take advantage of the modern CPU’s multiple processing capabilities, they are limited on a single computer node and, therefore, are not able to assemble very large datasets due to the limited memory capacity.…”

Section: Introductionmentioning

confidence: 99%

Persistent Memory as an Effective Alternative to Random Access Memory in Metagenome Assembly

Sun¹,

Egan

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The assembly of metagenomes decomposes members of complex microbe communities and allows the characterization of these genomes without laborious cultivation or single-cell metagenomics. Metagenome assembly is a process that is memory intensive and time consuming. Multi-terabyte sequences can become too large to be assembled on a single computer node, and there is no reliable method to predict the memory requirement due to data-specific memory consumption pattern. Currently, out-ofmemory (OOM) is one of the most prevalent factors that accounts for metagenome assembly failures. In this study, we explored the possibility of using Persistent Memory (PMem) as a less expensive substitute for dynamic random access memory (DRAM) to reduce OOM and increase the scalability of metagenome assemblers. We evaluated the execution time and memory usage of three popular metagenome assemblers (MetaSPAdes, MEGAHIT, and MetaHipMer2) in datasets up to one terabase. We found that PMem can enable metagenome assemblers on terabyte-sized datasets by partially or fully substituting DRAM at a cost of longer running times. In addition, different assemblers displayed distinct memory/speed trade-offs in the same hardware/software environment. Because PMem was provided directly without any application-specific code modification, these findings are likely to be generalized to other memory-intensive bioinformatics applications.

show abstract

Critical Assessment of Metagenome Interpretation: the second round of challenges

Cited by 212 publications

References 74 publications

From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools

From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools

KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping

Persistent Memory as an Effective Alternative to Random Access Memory in Metagenome Assembly

Contact Info

Product

Resources

About