12Mycobacterium bovis bacillus Calmette-Guérin (M. bovis BCG) is the only vaccine available against 13 tuberculosis (TB). This study reports on an integrated genome analysis workflow for BCG, resulting in the 14 completely assembled genome sequence of BCG Danish 1331 (07/270), one of the WHO reference strains 15 for BCG vaccines. We demonstrate how this analysis workflow enables the resolution of genome 16 duplications and of the genome of engineered derivatives of this vaccine strain. 17
MAIN TEXT 18The BCG live attenuated TB vaccine is one of the oldest and most widely used vaccines in human medicine. 19Each year, BCG vaccines are administered to over 100 million newborns (i.e. 75% of all newborns on the 20 planet). The original BCG strain was developed in 1921 at the Pasteur Institute, through attenuation of the 21 bovine TB pathogen M. bovis, by 231 serial passages on potato slices soaked in glycerol-ox bile over a time-22 span of 13 years 1 . This BCG Pasteur strain was subsequently distributed to laboratories around the world 23 and different laboratories maintained their own daughter strains by passaging. Over the years, different 24 substrains arose with different protective efficacy 2,3 . The establishment of a frozen seed-lot system in 1956 25 and the WHO recommendation of 1966 that vaccines should not be prepared from cultures that had 26 undergone >12 passages starting from a defined freeze-dried seed lot, halted the accumulation of 27 additional genetic changes 1 . In an effort to further standardize the vaccine production and to prevent 28 severe adverse reactions related to BCG vaccination, three substrains, i.e. Danish 1331, Tokyo 172-1 and 29Russian BCG-1 were established as the WHO reference strains in 2009 and 2010 4 . Of these, the BCG Danish 30 1331 strain is the most frequently used one, and it also serves as a basis of most current "next-generation" 31 engineering efforts to improve the BCG vaccine or to use it as a "carrier" for antigens of other pathogens 5,6 . 32Complete genome elucidation of BCG strains is challenging by the occurrence of large genome segment 33 2 duplications and a high GC content. Therefore, no fully assembled reference genome is yet available for 34 BCG Danish, only incomplete ones 7,8 , which hinders further standardization efforts. 35By combining second (Illumina) and third (PacBio) generation sequencing technologies and an integrated 36 bioinformatics workflow we have for the first time fully assembled the BCG Danish 1331 (07/270) strain 37 genome sequence. Ambiguous regions were locally reassembled and/or experimentally verified. The single 38 circular chromosome is 4,411,814 bp in length and encodes 4,084 genes, including 4,004 genes encoding 39 for proteins, 5 genes for rRNA, 45 genes for tRNA and 30 pseudogenes (Fig. 1a). Compared to the reference 40 genome sequence of BCG Pasteur 1173P2, 42 SNPs were identified and a selected subset was validated 41 Table 1 and 5). Genetic features determinative for BCG Danish, as described by Abdallah et al. 8 , 42 were ...