The increasing volume of genomic data opens new possibilities for analysis of protein function. We introduce a method for automated selection of residues that determine the functional specificity of proteins with a common general function (the specificity-determining positions [SDP] prediction method). Such residues are assumed to be conserved within groups of orthologs (that may be assumed to have the same specificity) and to vary between paralogs. Thus, considering a multiple sequence alignment of a protein family divided into orthologous groups, one can select positions where the distribution of amino acids correlates with this division. Unlike previously published techniques, the introduced method directly takes into account nonuniformity of amino acid substitution frequencies. In addition, it does not require setting arbitrary thresholds. Instead, a formal procedure for threshold selection using the Bernoulli estimator is implemented. We tested the SDP prediction method on the LacI family of bacterial transcription factors and a sample of bacterial water and glycerol transporters belonging to the major intrinsic protein (MIP) family. In both cases, the comparison with available experimental and structural data strongly supported our predictions.Keywords: Orthologs; specificity; prediction; mutual information; substitution matrix; cutoffThe exponential growth of genomic data strongly exceeds the capacity of experimental analysis of the protein function. On the other hand, intelligent use of the genomic data may save the experimentalists' effort. A standard technique of the functional protein annotation is the similarity database search. However, in many cases it allows one to assign a general function to a protein of interest (e.g., "transcriptional regulator of the LacI family"), but cannot resolve the protein's specificity (say, "purine or ribose repressor"). More detailed genomic analysis, using identification of orthologs, positional genomic analysis, metabolic reconstruction, analysis of regulation and other comparative techniques strongly improves the resolution of prediction (Koonin and Galperin 2003). In many cases, the comparative techniques allow one to tentatively assign common (often unknown) specificity to groups of proteins, and thus provide data for analysis of specificity-determining residues in protein sequences. An overview of some of these methods is given in Hannenhalli and Russell (2000). Some of them, in particular the evolutionary trace analysis (Lichtarge et al. 1996(Lichtarge et al. , 1997, and the structure-based approach to prediction of protein function (Johnson and Church 2000), rely strongly on the known protein structure or information about protein functional sites. However, in many cases the structural data are not available, and there are methods that use purely genomic data in the form of aligned protein Reprint requests to: Mikhail S. Gelfand, State Scientific Center GosNIIGenetika, 1st Dorozhny pr., 1, Moscow 113545, Russia; e-mail: gelfand@ ig-msk.ru; fax: 7-095-...