26Comprehensive reference data is essential for accurate taxonomic and functional 27 characterization of the human gut microbiome. Here we present the Unified Human 28Gastrointestinal Genome (UHGG) collection, a resource combining 286,997 genomes 29 representing 4,644 prokaryotic species from the human gut. These genomes contain over 625 30 million protein sequences used to generate the Unified Human Gastrointestinal Protein 31 (UHGP) catalogue, a collection that more than doubles the number of gut protein clusters over 32 the Integrated Gene Catalogue. We find that a large portion of the human gut microbiome 33 remains to be fully explored, with over 70% of the UHGG species lacking cultured 34 representatives, and 40% of the UHGP missing meaningful functional annotations. Intra-35 species genomic variation analyses revealed a large reservoir of accessory genes and single-36 nucleotide variants, many of which were specific to individual human populations. These freely 37 available genomic resources should greatly facilitate investigations into the human gut 38 microbiome. 39 3 Main 40 The human gut microbiome has been implicated in important phenotypes related to human 41 health and disease 1,2 . However, incomplete reference data that are missing microbial diversity 3 42 hamper our understanding of the roles of individual microbiome species, their interactions and 43 functions. Hence, establishing a comprehensive collection of microbial reference genomes and 44 genes is an important step for accurate characterization of the taxonomic and functional 45 repertoire of the intestinal microbial ecosystem. 46 47The Human Microbiome Project (HMP) 4 was a pioneering initiative to enrich our knowledge ( Supplementary Fig. 2), a standardized taxonomic framework based on a concatenated protein 112 phylogeny representing >140,000 public prokaryote genomes, fully resolved to the species 113 level (see 'Methods' for details on the taxonomy nomenclature used). However, over 60% of 114 6 the gut genomes could not be assigned to an existing species, confirming the majority of the 115 UHGG species lack representation in current reference databases. 116 117
Comparison of species recovered in individual studies 118We investigated how many of the 4,644 gut species were found in the different study 119 collections in order to determine their level of overlap and reproducibility, as well as the ratio 120 between cultured and uncultured species (Fig. 2a). The largest intersection was between the 121 collections of MAGs, with the same 1,081 species detected independently in the CIBIO, EBI 122 and HGM datasets, but not in any of the cultured genome studies. By restricting the analysis to 123 genomes recovered from 1,554 samples common to all three MAG studies, we found that 93-124 97% of species from each set were detected in at least one other MAG collection, and 79-86% 125 across all three ( Supplementary Fig. 3a). Similar level of species overlap was observed when 126 comparing studies on a per-sample basis ( Supplementary Fig. 3b). F...