A database of functional sites for proteins with known structures, SITE, is constructed and used in conjunction with a simple pattern matching program SiteMatch to evaluate possible function conservation in a recently constructed database of fold predictions for Escherichia coli proteins~Rychlewski L et al., 1999, Protein Sci 8:614-624!. In this and other prediction databases, fold predictions are based on algorithms that can recognize weak sequence similarities and putatively assign new proteins into already characterized protein families. It is not clear whether such sequence similarities arise from distant homologies or general similarity of physicochemical features along the sequence. Leaving aside the important question of nature of relations within fold superfamilies, it is possible to assess possible function conservation by looking at the pattern of conservation of crucial functional residues. SITE consists of a multilevel function description based on structure annotations and structure analyses. In particular, active site residues, ligand binding residues, and patterns of hydrophobic residues on the protein surface are used to describe different functional features. SiteMatch, a simple pattern matching program, is designed to check the conservation of residues involved in protein activity in alignments generated by any alignment method. Here, this procedure is used to study conservation of functional features in alignments between protein sequences from the E. coli genome and their optimal structural templates. The optimal templates were identified and alignments taken from the database of genomic structural predictions was described in a previous publication~Rychlewski L et al., 1999, Protein Sci 8:614-624!. An automated assessment of function conservation is used to analyze the relation between fold and function similarity for a large number of fold predictions. For instance, it is shown that identifying low significance predictions with a high level of functional residue conservations can be used to extend the prediction sensitivity for fold prediction methods. Over 100 new fold0function predictions in this class were obtained in the E. coli genome. At the same time, about 30% of our previous fold predictions are not confirmed as function predictions, further highlighting the problem of function divergence in fold superfamilies.
Keywords: fold assignments; function predictions; genome analysisThe prediction of protein folds and functions from sequence is the "Holy Grail" of molecular biology. With improving sequencing methods, the number of known protein sequences has increased over 10-fold in the last two years and is expected to grow even faster in the next several years. The experimental characterization of new proteins is also improving, but at a much slower rate. Consequently, computer analysis of new sequences, particularly aiming at recognition of similarity to the already characterized protein families, has become a primary tool for analysis of new sequences. For instance, most newl...