Challenges in cultivating microorganisms have limited the phylogenetic diversity of currently available microbial genomes. This is being addressed by advances in sequencing throughput and computational techniques that allow for the cultivation-independent recovery of genomes from metagenomes. Here, we report the reconstruction of 7,903 bacterial and archaeal genomes from > 1,500 public metagenomes. All genomes are estimated to be ≥ 50% complete and nearly half are ≥ 90% complete with ≤ 5% contamination. These genomes increase the phylogenetic diversity of bacterial and archaeal genome trees by > 30% and provide the first representatives of 17 bacterial and three archaeal candidate phyla. We also recovered 245 genomes from the Patescibacteria superphylum (also known as the Candidate Phyla Radiation) and find that the relative diversity of this group varies substantially with different protein marker sets. The scale and quality of this data set demonstrate that recovering genomes from metagenomes provides an expedient path forward to exploring microbial dark matter.
Articles
NATuRe MiCRobiologyform the UBA data set as they met our filtering criteria of having an estimated quality ≥ 50 (defined as the estimated completeness of a genome minus five times its estimated contamination) and consisting of ≤ 500 scaffolds with an N50 ≥ 10 kb ( Fig. 1 and Supplementary Table 2). Over 93% of the 7,903 UBA genomes have an average coverage of ≥ 10× (5th percentile, 9.2× , 95th percentile, 268× ) and 95.8% have > 5× coverage over 90% of bases, providing assurance of high-quality base-calling across the genomes 3,36 . Among the UBA genomes is a subset of 3,438 near-complete genomes (3,225 bacterial and 213 archaeal) estimated to be ≥ 90% complete with ≤ 5% contamination (Fig. 1a). These genomes consist of ≤ 100 scaffolds in 70.2% of cases (≤ 200 scaffolds in 92.0% genomes) and have an average N50 of 136 kb. Comparison of near-complete UBA genomes that are conspecific strains of complete isolate genomes also suggest that the recovered MAGs have no systematic loss of genomic content, with the exception of extrachromosomal elements such as plasmids (Supplementary Note 1).The UBA data set was also assessed relative to the criteria used by the Human Microbiome Project (HMP) for defining high-quality draft genomes 3,37 . Of the 3,438 UBA genomes we have defined as near complete, 3,201 (93.1%) pass all of the HMP criteria, with the only substantial exception being 4.8% of the genomes having scaffolds with an N50 of < 20 kb (Supplementary Table 3). Nearly half of the remaining 4,465 UBA genomes also pass the HMP criteria for being high quality except that they are estimated to be < 90% complete.The presence of tRNAs for the standard 20 amino acids was examined as a secondary measure of genome quality (Fig. 1c). The 3,438 near-complete UBA genomes have tRNAs that encode for an average of 17.3 ± 2.2 of the 20 amino acids and ≥ 15 amino acids in 90.3% of the genomes. The correlation between estimated genome completeness and identified tRN...