We developed a novel efficient scheme, DEFOG (for "deciphering families of genes"), for determining sequences of numerous genes from a family of interest. The scheme provides a powerful means to obtain a gene family composition in species for which high-throughput genomic sequencing data are not available. DEFOG uses two key procedures. The first is a novel algorithm for designing highly degenerate primers based on a set of known genes from the family of interest. These primers are used in PCR reactions to amplify the members of the gene family. The second combines oligofingerprinting of the cloned PCR products with clustering of the clones based on their fingerprints. By selecting members from each cluster, a low-redundancy clone subset is chosen for sequencing. We applied the scheme to the human olfactory receptor (OR) genes. OR genes constitute the largest gene superfamily in the human genome, as well as in the genomes of other vertebrate species. DEFOG almost tripled the size of the initial repertoire of human ORs in a single experiment, and only 7% of the PCR clones had to be sequenced. Extremely high degeneracies, reaching over a billion combinations of distinct PCR primer pairs, proved to be very effective and yielded only 0.4% nonspecific products. characterized in rat a decade ago [2] and have since been detected in many vertebrate species [reviewed in 6]. So far, about 3000 OR genes and pseudogenes are known in 24 vertebrate species [7][8][9][10][11][12]. They are divided into 32 distinct families based on phylogenetic analysis [8].Roughly 900 OR coding sequences were found in the first draft of the human genome [9,10]. As predicted [7,13], between 53% and 63% of them have frame disruptions and are therefore considered as pseudogenes. OR genes are found on almost all human chromosomes except chromosome 20 and Y [7,9,10,13,14]. Almost 80% of the ORs are organized in clusters of six or more genes [9,10]. This is in good agreement with previous fluorescence in situ hybridization (FISH) experiments and sequencing data [7,13,14].
296Article doi:10.1006/geno.2002.6830, available online at http://www.idealibrary.com on IDEAL The coding region of OR genes spans approximately 1 kb. This region lacks introns [2,13] and is preceded by a large intron and several short noncoding exons [16][17][18][19][20]. Inside the coding region there are several conserved segments [21] that allow easy amplification of the intronless coding region from genomic DNA by PCR assay. The PCR product is termed the "olfactory receptor sequence tag" (OST) [7].We developed and experimentally tested a practical scheme for deciphering families of genes (DEFOG). The scheme provides a powerful means to obtain a sequence composition of a gene family in species for which high-throughput genomic sequencing data are unavailable. To validate DEFOG, we tested it on the human OR gene superfamily. Starting with a limited number of human ORs, it almost tripled the size of this set in a single experiment. We suggest that DEFOG can be successfully appli...