We have developed a simple procedure to identify protein homologs in genomic databases. The program, called ORF, is based on comparisons of predicted secondary structure. Protein structure is far better conserved than amino acid sequence, and structure-based methods have been effective in exploiting this fact to find homologs, even among proteins with scant sequence identity. ORF is a secondary structure-based method that operates solely on predictions from sequence and requires no experimentally determined information about the structure. The approach is illustrated by an example: Thymidylate synthase, a highly conserved enzyme essential to thymidine biosynthesis in both prokaryotes and eukaryotes, is thought to be used by Archaea, but a corresponding gene has yet to be identified. Here, a candidate thymidylate synthase is identified as a previously unassigned open reading frame from the genome of Methanococcus jannaschii, viz., MJ0757. Using primary structure information alone, the optimally aligned sequence identity between MJ0757 and Escherichia coli thymidylate synthase is 7%, well below the threshold of sensitivity for detection by sequence-based methods.At least 12 genomes now have been sequenced from diverse organisms, with many additions anticipated in coming weeks. How can this wealth of information best be used to address fundamental questions in biology? In particular, how can related protein domains be identified among organisms that diverged during the Cambrian explosion or earlier (1)? The mechanism of protein evolution gives rise to homologous sequences, with attendant redundancy. Computational biologists have exploited this fact in developing powerful recognition tools. Among these, sequence-based methods (2) to recognize homologs are well developed, but sensitivity falters as sequence similarity sinks into the ''twilight zone,'' a threshold near 30% sequence identity (3). Sensitivity can be extended by using information from multiple aligned sequence families (4, 5), local multiple alignment of blocks (6-9), and structurebased fold recognition such as threading (ref. 10 and references therein) and profiles (11).Here we present a procedure for homolog recognition based on secondary structure prediction. The method is implemented in a computer program called ORF, an acronym for Ostensible Recognition of Folds. Unlike many other fold recognition approaches, ORF requires no three-dimensional template. In brief, ORF operates solely on sequence information to predict the secondary structure of both an unknown protein and all entries in a database of interest and then uses this information in a query-against-all alignment to select likely candidates. The strategy is based on a simple idea: although sequence space is vast, the number of conceivable protein folds is small, of order 5,000 or fewer (12-15). Typically, such folds can be parsed into a linear sequence of repetitive secondary structure elements interconnected by intervening nonrepetitive regions (i.e., helices, -strands, and everythi...