During the past two decades, mass spectrometry has become established as the primary method for protein identification from complex mixtures of biological origin. This is largely attributable to the fortunate coincidence of instrumental advances that allow routine analysis of minute amounts (typically femtomoles) of involatile, polar compounds such as peptides in complex mixtures, with the rapid growth in genomic databases that are amenable to searching with mass spectrometry (MS) 1 data. Like many other developing fields in science, the creation of techniques and software tools and the initial generation and interpretation of data have been the domain of experts, people who are cognizant not only of the benefits of the methods but also of their actual and potential weaknesses. Now, as mass spectrometric techniques and proteomic tools become increasingly available and accessible, a much broader range of researchers is applying the same methodology, often with substantially less understanding of the major limitations that critically affect the reliability and significance of the results. Ideally, the MS community should establish criteria for mass spectrometric identification of proteins that should be employed by researchers. As this remains a rapidly developing field with many different experimental approaches and different ways of searching and interpreting the data, it is difficult to promulgate hard and fast rules. Nevertheless, Molecular & Cellular Proteomics is attempting to develop standards of acceptability for proteomics papers, based on emerging knowledge as well as on principles of biological MS established over the last 20 years by the MS community. Authors of proteomics papers employing MS must make themselves fully aware of the key issues that are driving development of these guidelines. Hence, the paper that follows attempts to highlight the strengths and weaknesses of the methods in current use. It is particularly important to realize that for any protein match returned from a database search, there is a non-zero probability that it will be wrong. Many times, the quality of the data is such that the probability of a false positive can be disregarded, but in some cases the identifications returned by the search engines are very likely incorrect. Therefore, it is unacceptable to simply list all the hits that come back from any search engine and then discuss their biological significance as though they were categorically correct.
PEPTIDE ANALYSISAlmost without exception, protein identification is based on the analysis of peptides generated by proteolyic digestion. The most widely used enzyme is trypsin, which hydrolyzes the protein on the C-terminal side of lysine and arginine, unless the subsequent amino acid in the sequence is a proline. This is advantageous as every peptide other than the protein C terminus has at least two sites for efficient protonation, the N-terminal amino group and the C-terminal basic residue, so peptides are readily ionized and detected as positive ions. However, for a v...