As experimental technologies for characterization of proteomes emerge, bioinformatic analysis of the data becomes essential. Separation and identification technologies currently based on two-dimensional gels/mass spectrometry provide the inherent analytical power required. This strategy involves protein spot digestion and accurate mass mapping together with computational interrogation of available data bases for protein functional identification. When either no exact match is found or when the possible matches only partially account for molecular weights actually observed, peptide sequencing by tandem mass spectrometry has emerged as the methodology of choice to provide the basic additional information required. To evaluate the capabilities of bioinformatics methods employed for identifying homologs of a protein of interest, we attempted to identify the major proteins from the 20 S proteasome of Trypanosoma brucei using sequence information determined using mass spectrometry. The results suggest that neither the traditional query engines, BLAST and FASTA, nor specialized software developed for analysis of sequence information obtained by mass spectrometry are able to identify even closely related sequences at statistically significant scores. To address this deficit, new bioinformatics approaches were developed for concomitant use of the multiple fragments of short sequence typically available from methods of tandem mass spectrometry. These approaches rely on the occurrence of congruence across searches of multiple fragments from a single protein. This method resulted in sharply better statistical significance values for correct hits in the data base output relative to that achieved for independent searches using single sequence fragments.Fueled by the genome projects, encyclopedic increases in the banking of newly obtained, comprehensive biological data are transforming studies of biology and medicine (1). As the postgenomic era moves into high gear, new "high throughput" technologies are allowing characterization of gene expression profiles, comparisons of genomic complements, and identification of the genetic markers associated with normal, pathological, or environmentally triggered states. Yet information derived from full analysis of genomics alone is clearly inadequate to explain the complexities of cell biology. Recent studies showing differences between the genome and the proteome suggest that the profound understanding we seek will require the complete and direct characterization of the proteome as well (2, 3).Peptide mass mapping by MALDI-TOF 1 MS (4) or liquid chromatography-electrospray ionization MS (5, 6), combined with interrogation of sequence data bases (7-12), currently is the most widely employed strategy for the identification of expressed proteins. This methodology involves electrophoretic separation of proteins at sub-picomole levels, digestion with trypsin, and measurement of the molecular weights of the resulting peptide mixture by mass spectrometry. This strategy can routinely identify p...