Background
Microorganisms are not only indispensable to ecosystem functioning, they are also keystones for emerging technologies. In the last 15 years, the number of studies on environmental microbial communities has increased exponentially due to advances in sequencing technologies, but the large amount of data generated remains difficult to analyze and interpret. Recently, metabarcoding analysis has shifted from clustering reads using Operational Taxonomical Units (OTUs) to Amplicon Sequence Variants (ASVs). Differences between these methods can seriously affect the biological interpretation of metabarcoding data, especially in ecosystems with high microbial diversity, as the methods are benchmarked based on low diversity datasets.
Results
In this work we have thoroughly examined the differences in community diversity, structure, and complexity between the OTU and ASV methods. We have examined culture-based mock and simulated datasets as well as soil- and plant-associated bacterial and fungal environmental communities. Four key findings were revealed. First, analysis of microbial datasets at family level guaranteed both consistency and adequate coverage when using either method. Second, the performance of both methods used are related to community diversity and sample sequencing depth. Third, differences in the method used affected sample diversity and number of detected differentially abundant families upon treatment; this may lead researchers to draw different biological conclusions. Fourth, the observed differences can mostly be attributed to low abundant (relative abundance < 0.1%) families, thus extra care is recommended when studying rare species using metabarcoding. The ASV method used outperformed the adopted OTU method concerning community diversity, especially for fungus-related sequences, but only when the sequencing depth was sufficient to capture the community complexity.
Conclusions
Investigation of metabarcoding data should be done with care. Correct biological interpretation depends on several factors, including in-depth sequencing of the samples, choice of the most appropriate filtering strategy for the specific research goal, and use of family level for data clustering.