The gene complement of wild-type human cytomegalovirus (HCMV) is incompletely understood, on account of the size and complexity of the viral genome and because laboratory strains have undergone deletions and rearrangements during adaptation to growth in culture. We have determined the sequence (241 087 bp) of chimpanzee cytomegalovirus (CCMV) and have compared it with published HCMV sequences from the laboratory strains AD169 and Toledo, with the aim of clarifying the gene content of wild-type HCMV. The HCMV and CCMV genomes are moderately diverged and essentially collinear. On the basis of conservation of potential proteincoding regions and other sequence features, we have discounted 51 previously proposed HCMV ORFs, modified the interpretations for 24 (including assignments of multiple exons) and proposed ten novel genes. Several errors were detected in the published HCMV sequences. We presently recognize 165 genes in CCMV and 145 in AD169; this compares with an estimate of 189 unique genes for AD169 made in 1990. Our best estimate for the complement of wild-type HCMV is 164 to 167 genes.
INTRODUCTIONHuman cytomegalovirus (HCMV; human herpesvirus 5) is ubiquitous and largely inapparent, but poses a risk of serious disease to those lacking a competent immune system, such as neonates, transplant patients and sufferers from AIDS (reviewed in Pass, 2001). HCMV is the prototype of subfamily Betaherpesvirinae, and is the most complex of the eight human herpesvirus species. HCMV is isolated routinely on human fibroblast cell lines, and several strains in common laboratory use, such as AD169 and Towne, were derived by multiple passages on such cells (reviewed in Mocarski & Tan Courcelle, 2001).The linear, double-stranded DNA genome of AD169 comprises two covalently linked segments (L and S), each consisting of a unique region (U L and U S ) flanked by an inverted repeat (TR L and IR L , TR S and IR S ), yielding the overall genome configuration TR L -U L -IR L -IR S -U S -TR S (reviewed in Mocarski & Tan Courcelle, 2001). In addition, the genome is terminally redundant, possessing a short region (the a sequence) as a direct repeat at the termini and also in inverse orientation at the IR L -IR S junction. Some genomes contain tandemly reiterated copies of the a sequence at these locations. U L and U S can invert relative to each other by recombination between inverted repeats in replicating DNA, resulting in four equimolar genome arrangements in virion DNA. The complete DNA sequence of AD169 was published in a seminal paper by Chee et al. (1990), and at that time was the largest viral genome sequence available. The total genome size was 229 354 bp, with U L being 166 972 bp, U S 35 418 bp, R L (a collective term for TR L and IR L ) 11 247 bp, R S (TR S and IR S ) 2524 bp and the a sequence (part of R L and R S in the sizes given above) 578 bp.As a primary criterion for identifying protein-coding regions, Chee et al. (1990) focused on open reading frames (ORFs) of 100 or more contiguous amino acidencoding codons that ov...