Background: 16S rRNA gene amplicon sequencing is a very popular approach for studying microbiomes. However, varying standards exist for sample and data processing and some basic concepts such as the occurrence of spurious sequences have not been investigated in a comprehensive manner, which was done in the present study. Methods: Using defined communities of bacteria in vitro and in vivo , we searched for sequences not matching the expected species ( i.e. , spurious taxa) and determine a threshold of occurrence relevant for adequate data analysis. The origin of spurious taxa was then investigated via large-scale amplicon queries. We also assessed the impact of varying sequence filtering stringency on diversity readouts in human fecal and peat soil communities. Results: 16S rRNA gene amplicon data processing based on Operational Taxonomic Units (OTUs) clustering and singleton removal, a commonly used approach that discards any taxa represented by only one sequence across all samples, delivered approx. 50% (mock communities) to 80% (gnotobiotic mice) spurious taxa on average. This spurious fraction of taxa was lower based on amplicon sequence variants (ASVs) analysis but varied depending on the gene region targeted and the barcoding system used. A relative abundance of 0.25% was identified as a threshold below which the analysis of spurious taxa can be prevented to a large extent. Most spurious taxa (approx. 70%) detected in simplified communities occurred in samples multiplexed in the same sequencing run and were present in only one of ten runs. Use of the 0.25% relative abundance threshold decreased the coefficient of variations calculated on richness in the same six human fecal samples across seven sequencing runs by 38% compared with singleton filtering. The output of beta -diversity analyses of human fecal communities was markedly affected by both the filtering strategy and the type of phylogenetic distances used for comparing samples. Importantly, major findings were confirmed by using data generated in a second sequencing facility. Conclusions: Handling of artifact sequences during bioinformatic processing of 16S rRNA gene amplicon data requires careful attention to avoid the generation of misleading findings. A threshold of relative abundance of 0.25% is more appropriate than singleton removal, although study-specific analysis strategies are mandatory. We propose the concept of effective richness, which will help comparing results across studies.
Background: 16S rRNA gene amplicon sequencing is a very popular approach for studying microbiomes. However, varying standards exist for sample and data processing and some basic concepts, such as the occurrence of spurious sequences, have not been investigated in a comprehensive manner. Methods: Using defined communities of bacteria in vitro and in vivo, we searched for sequences not matching the expected species (i.e., spurious taxa) and determined a minimum threshold of occurrence suitable for robust data analysis. The presence and origin of spurious taxa were investigated via large-scale amplicon queries and gut samples from germfree mice spiked with target mock DNA. We also assessed the effect of varying sequence-filtering stringency on diversity readouts in human fecal and peat soil communities. Our findings are based on data generated in three sequencing facilities and analyzed via both operational taxonomic units (OTUs) and amplicon sequence variants (ASVs) approaches.Results: 16S rRNA gene amplicon data-processing based on OTUs clustering and singleton removal, a commonly used approach that discards any taxa represented by only one sequence across all samples, delivered an average approximately 50% (mock communities) to 80% (gnotobiotic mice) spurious taxa. The fraction of spurious taxa was generally lower based on ASV analysis, but varied depending on the gene region targeted and the barcoding system used. A relative abundance of 0.25% was found as an effective threshold below which the analysis of spurious taxa can be prevented to a large extent in both OTU- and ASV-based analysis approaches. Most spurious taxa (approx. 70%) detected in simplified communities occurred in samples multiplexed in the same sequencing run and were present in only one of ten runs. DNase treatment of gut content from germfree mice partly helped to exclude spurious taxa from the analysis of spiked mock DNA, but was not necessary when applying the 0.25% relative abundance threshold. Using this cut-off improved the reproducibility of analysis, i.e., specifically by reducing variation in richness estimates by 38% compared with singleton filtering in a benchmarking experiment using six human fecal samples across seven sequencing runs. Beta-diversity analyses of human fecal communities was markedly affected by both the filtering strategy and the type of phylogenetic distances used for comparing samples, highlighting the importance of carefully analyzing data before drawing conclusions. Conclusions: Handling of artifact sequences during bioinformatic processing of 16S rRNA gene amplicon data requires careful attention to avoid the generation of misleading findings. Applying a minimum relative abundance threshold between 0.10 and 0.30% is superior to the singleton removal approach, although study-specific analysis strategies may be needed depending on, for instance, the type of samples analyzed and the sequencing depth achieved. Additionally, we propose the concept of effective richness to facilitate the comparison of results across studies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.