Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. Although this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of "marker" genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree and information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate-, single-cell-, and metagenome-derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene-and genome-centric analyses of microbial communities.
STAMP is licensed under the GNU GPL. Python source code and binaries are available from our website at: http://kiwi.cs.dal.ca/Software/STAMP.
Microbial communities are vital in the functioning of all ecosystems; however, most microorganisms are uncultivated, and their roles in natural systems are unclear. Here, using random shotgun sequencing of DNA from a natural acidophilic biofilm, we report reconstruction of near-complete genomes of Leptospirillum group II and Ferroplasma type II, and partial recovery of three other genomes. This was possible because the biofilm was dominated by a small number of species populations and the frequency of genomic rearrangements and gene insertions or deletions was relatively low. Because each sequence read came from a different individual, we could determine that single-nucleotide polymorphisms are the predominant form of heterogeneity at the strain level. The Leptospirillum group II genome had remarkably few nucleotide polymorphisms, despite the existence of low-abundance variants. The Ferroplasma type II genome seems to be a composite from three ancestral strains that have undergone homologous recombination to form a large population of mosaic genomes. Analysis of the gene complement for each organism revealed the pathways for carbon and nitrogen fixation and energy generation, and provided insights into survival strategies in an extreme environment.
Challenges in cultivating microorganisms have limited the phylogenetic diversity of currently available microbial genomes. This is being addressed by advances in sequencing throughput and computational techniques that allow for the cultivation-independent recovery of genomes from metagenomes. Here, we report the reconstruction of 7,903 bacterial and archaeal genomes from > 1,500 public metagenomes. All genomes are estimated to be ≥ 50% complete and nearly half are ≥ 90% complete with ≤ 5% contamination. These genomes increase the phylogenetic diversity of bacterial and archaeal genome trees by > 30% and provide the first representatives of 17 bacterial and three archaeal candidate phyla. We also recovered 245 genomes from the Patescibacteria superphylum (also known as the Candidate Phyla Radiation) and find that the relative diversity of this group varies substantially with different protein marker sets. The scale and quality of this data set demonstrate that recovering genomes from metagenomes provides an expedient path forward to exploring microbial dark matter. Articles NATuRe MiCRobiologyform the UBA data set as they met our filtering criteria of having an estimated quality ≥ 50 (defined as the estimated completeness of a genome minus five times its estimated contamination) and consisting of ≤ 500 scaffolds with an N50 ≥ 10 kb ( Fig. 1 and Supplementary Table 2). Over 93% of the 7,903 UBA genomes have an average coverage of ≥ 10× (5th percentile, 9.2× , 95th percentile, 268× ) and 95.8% have > 5× coverage over 90% of bases, providing assurance of high-quality base-calling across the genomes 3,36 . Among the UBA genomes is a subset of 3,438 near-complete genomes (3,225 bacterial and 213 archaeal) estimated to be ≥ 90% complete with ≤ 5% contamination (Fig. 1a). These genomes consist of ≤ 100 scaffolds in 70.2% of cases (≤ 200 scaffolds in 92.0% genomes) and have an average N50 of 136 kb. Comparison of near-complete UBA genomes that are conspecific strains of complete isolate genomes also suggest that the recovered MAGs have no systematic loss of genomic content, with the exception of extrachromosomal elements such as plasmids (Supplementary Note 1).The UBA data set was also assessed relative to the criteria used by the Human Microbiome Project (HMP) for defining high-quality draft genomes 3,37 . Of the 3,438 UBA genomes we have defined as near complete, 3,201 (93.1%) pass all of the HMP criteria, with the only substantial exception being 4.8% of the genomes having scaffolds with an N50 of < 20 kb (Supplementary Table 3). Nearly half of the remaining 4,465 UBA genomes also pass the HMP criteria for being high quality except that they are estimated to be < 90% complete.The presence of tRNAs for the standard 20 amino acids was examined as a secondary measure of genome quality (Fig. 1c). The 3,438 near-complete UBA genomes have tRNAs that encode for an average of 17.3 ± 2.2 of the 20 amino acids and ≥ 15 amino acids in 90.3% of the genomes. The correlation between estimated genome completeness and identified tRN...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.