12 Background 13 Analysis of metagenomic and metatranscriptomic data is complicated and typically 14 requires extensive computational resources. Leveraging a curated reference database 15 of genes encoded by members of the target microbiome can make these analyses more 16 tractable. Unfortunately, there is no such reference database available for the vaginal 17 microbiome. 18 Results 19 In this study, we assembled a comprehensive human vaginal non-redundant gene 20 catalog (VIRGO) from 264 vaginal metagenomes and 416 genomes of urogenital 21 bacterial isolates. VIRGO includes 0.95 million non-redundant genes compiled from a 22 3 Keywords 39 vaginal microbiome, metagenome and metatranscriptome reference database, non-40 redundant gene catalog, intraspecies diversity, gene-centric design, protein family 41 catalog, multi-omics data integration 42 4 Background 43The microbial communities that inhabit the human body play critical roles in the 44 maintenance of health, and dysfunction of these communities is often associated with 45 disease [1]. Taxonomic profiling of the human microbiome via 16S rRNA gene amplicon 46 sequencing has provided critical insight into the potential role of the microbiota in a wide 47 array of common diseases [2][3][4]. Yet these data routinely fall short of describing the 48 etiology of such microbiome-associated diseases, such as bacterial vaginosis [5, 6], 49Crohn's disease [7, 8] or psoriasis [9], among others. This is perhaps because while 50 16S rRNA gene sequencing can provide species-level taxonomic profiles of a microbial 51 community, it does not describe the genes or metabolic functions that are encoded in 52 the constituents' genomes. This is an important distinction because strains of a bacterial 53 species have been documented to exhibit substantial diversity in gene content [10], 54 such that their genomes harbor sets of accessory genes whose presence is variable 55 [11, 12]. It is therefore difficult, if not impossible, to infer the complete function of a 56 microbial species in a specific environment using only the sequence of their 16S rRNA 57 gene. As a consequence, to investigate the role of the human microbiome in health and 58 diseases, particular emphasis should be placed on describing the gene content and 59 gene expression of these microbial communities. 60 61 Metagenomic and metatranscriptomic profiling are emerging approaches aimed at 62 characterizing the gene content and expression of microbial communities. Results have 63 led to increased appreciation for the important role microbial communities play in human 64 health and diseases [13, 14]. Despite the rapid development and increased throughput 65 of sequencing technologies, current knowledge of the genetic and functional diversity of 66 microbial community is still highly limited. This is due, at least in part, to a lack of 67 resources necessary for the analysis of these massive short read datasets [13, 15]. De 68 novo assembly of metagenomic or metatranscriptomic datasets typically requires rather 69 subst...