2022
DOI: 10.1101/2022.06.30.498336
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Pitfalls of genotyping microbial communities with rapidly growing genome collections

Abstract: Detecting genetic variants in metagenomic data is a priority for understanding the evolution, ecology, and functional characteristics of microbial communities. Many recent tools that perform this metagenotyping rely on aligning reads of unknown origin to a reference database of sequences from many species before calling variants. Using simulations designed to represent a wide range of scenarios, we demonstrate that diverse and closely related species both reduce the power and accuracy of reference-based metage… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
5
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2
1

Relationship

2
1

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 86 publications
0
5
0
Order By: Relevance
“…Across the 45 isolates, the median SNP false discovery rate (FDR) was very low for both methods (0.28% and 0.62% for Maast and Snippy; Figure 3c and S7a). Meanwhile, the sensitivity of Maast was consistently higher (median=93.2%) compared to Snippy (median=54.3%) (Figure 3d and Figure S7c), since Maast is less subject to false negatives due to reference bias [13] and coverage filtering (i.e., minimum 1X by default compared to minimum 10X for Snippy). Since simulated reads may capture less contamination and sequencing errors than real WGS data, we repeated the evaluation with 63 short read libraries of the same species (B. uniformis, Table S4) downloaded from the CGR study.…”
Section: Maast Snp Genotypes Are Highly Accuratementioning
confidence: 96%
See 1 more Smart Citation
“…Across the 45 isolates, the median SNP false discovery rate (FDR) was very low for both methods (0.28% and 0.62% for Maast and Snippy; Figure 3c and S7a). Meanwhile, the sensitivity of Maast was consistently higher (median=93.2%) compared to Snippy (median=54.3%) (Figure 3d and Figure S7c), since Maast is less subject to false negatives due to reference bias [13] and coverage filtering (i.e., minimum 1X by default compared to minimum 10X for Snippy). Since simulated reads may capture less contamination and sequencing errors than real WGS data, we repeated the evaluation with 63 short read libraries of the same species (B. uniformis, Table S4) downloaded from the CGR study.…”
Section: Maast Snp Genotypes Are Highly Accuratementioning
confidence: 96%
“…Sequence alignment is the major obstacle to analyzing so many genomes, though kSNP also remain largely untested with thousands of strains. A second challenge is the fact that many species have a high level of genome redundancy[13], especially when a biased sample of clonally related genomes has been sequenced, which is common for clinically important pathogens that are under intensive surveillance (e.g., PulseNet [14] and NCBI Pathogen Detection). This redundancy masks the diversity of unevenly sampled species, and it means that strains from poorly sampled lineages contribute little to the discovery of SNPs, especially when a relatively high MAF threshold is used.…”
Section: Introductionmentioning
confidence: 99%
“…The first problem is fundamental: specifying and relying on a reference genome can be highly limiting for biological inference, and generating and or aligning to a reference is foundational in the field of genomics today. Alignment-based methods struggle when sequences can map to multiple or repetitive locations, and or reference genomes are incorrect or incomplete (Shi et al ., 2022; Zhao, Shi and Pollard, 2022). These regions comprise ∼54 % of the human genome (Nurk et al ., 2022) and are sometimes among the most important to analyze.…”
Section: Introductionmentioning
confidence: 99%
“…Consider transposable elements, known to drive evolution and cause an unascertained number of human diseases (Pascarella et al ., 2022). A custom and multi-step workflow is required to detect these insertions because they are highly polymorphic, and the algorithm must address the issues of multi-mapping and the high degree of repetitive sequence (Shi et al ., 2022; Zhao, Shi and Pollard, 2022). Similarly, if V(D)J recombination is present in a sample, but a custom workflow is not specified to detect these rearrangements, they will not be reported.…”
Section: Introductionmentioning
confidence: 99%
“…For a non-comprehensive reference database, species in the sample but missing from the reference database cannot be genotyped, causing false negative results. Conversely, for a non-representative reference genome, closely related species in the reference database but not in the sample may compete for reads, reducing read alignment uniqueness and even generating “phantom” metagenotypes when reads from another species are incorrectly aligned ( Zhao, Zhou, & Pollard, 2022b ). MIDAS2 combats these problems by selecting genomes from a comprehensive reference database to build a sample-specific reference database customized to species that are present in the samples (adjustable to scientific objectives), and by tuning alignment and filtering parameters.…”
Section: Introductionmentioning
confidence: 99%