An increasing number of studies recognizes the importance of characterizing species diversity and composition of bacteria hosted by biota for systems that range from oceans to humans. This task is typically addressed by using environmental sequencing data ("metagenomics"). However, determining microbiomes diversity implies the classification of species composition within the sampled community, which is often done via the assignment of individual reads to taxa by comparison to a reference database.Although computational methods aimed at identifying the microbe(s) taxa are available, it is well known that the inference of the community from the same sample using different methods can vary widely depending on the various biases in each step of the analysis. In this study, we compare different bioinformatics methods based on amplicon sequencing of 16S ribosomal RNA and whole genome shotgun sequencing for taxonomic classification. We apply the methods to three mock communities of bacteria, of which the composition is known. We show that 16S data reliably allow to detect the number January 3, 2020 1/30 of species, but not the abundances, while standard methods based on shotgun data give a reliable estimate of the most abundant species, but predict a large number of false-positive species. We thus propose a novel approach, that combines shotgun data with a classification based on core protein families (PFAM), hence similar in spirit to 16S. We show that this method reliably predicts both number of species and abundance of the bacterial mock communities.
Author summaryCharacterizing species diversity and composition of bacteria hosted by biota is revolutionizing our understanding and conception of the role of symbiotic interactions in ecosystems. However, determining microbiomes diversity implies the classification of species composition within the sampled community. Although many computational methods aimed at identifying the microbe(s) taxa are available, it is well known that the inference of the community from the same sample using different methods can vary widely depending on the various biases in each step of the analysis. In most of the studies, when benchmarking protocols for taxonomic classification from biological samples, the "ground truth" of the contained species and their relative abundances is not known. Therefore, the use of mock communities or simulated datasets remains as basis for a robust comparative evaluation of a methods prediction accuracy. In this work, we first compare different bioinformatics methods for taxonomic classification.We apply the methods to three mock communities of bacteria, of which the composition is known. We show that no method is able to correctly predict both the number of species and their abundances. We then propose a novel approach based on core protein families, reliably inferring both number of species and abundance of the bacterial mock communities. January 3, 2020 2/30 Modern high-throughput genome sequencing techniques revolutionized ecological studies 2 of microbial...