Database searching is an essential element of large-scale proteomics. Because these methods are widely used, it is important to understand the rationale of the algorithms. Most algorithms are based on concepts first developed in SEQUEST and PeptideSearch. Four basic approaches are used to determine a match between a spectrum and sequence: descriptive, interpretative, stochastic and probability-based matching. We review the basic concepts used by most search algorithms, the computational modeling of peptide identification and current challenges and limitations of this approach for protein identification.An unintended consequence of whole-genome sequencing has been the birth of large-scale proteomics. What drives proteomics is the ability to use mass spectrometry data of peptides as an 'address' or 'zip code' to locate proteins in sequence databases. Two mass spectrometry methods are used to identify proteins by database search methods. The first method uses a molecular weight fingerprint measured from a protein digested with a site-specific protease [1][2][3][4][5] . A second method uses tandem mass spectra derived from individual peptides of a digested protein 6,7 (Fig. 1). Because each tandem mass spectrum represents an independent and verifiable piece of data, this approach to database searching has the ability to identify proteins in mixtures, enabling a rapid and comprehensive approach for the analysis of protein complexes and other complicated mixtures of proteins 6,[8][9][10][11][12] . New biology has been discovered based on fast and accurate protein identification [13][14][15][16][17][18] . As tandem mass spectral protein identification has proliferated, it has become increasingly important to understand the rationale of individual database search algorithms, their relative strengths and weaknesses, and the mathematics used to match sequence to spectrum.In this review we discuss the prevailing fragmentation models, spectral preprocessing, methods to match tandem mass spectra to sequences and several approaches to matching tandem mass spectra of peptides whose exact sequences may not be present in the database. Space limitations restrict a detailed description of all algorithms in this rapidly expanding field. Also, some algorithms are proprietary, and thus, details on how they work are unknown. This review should supplement and update earlier reviews on database search algorithms [19][20][21][22][23][24] .
Peptide fragmentation and data preprocessingIn tandem mass spectrometry (MS/MS), gas phase peptide ions undergo collision-induced dissociation (CID) with molecules of an inert gas such as helium or argon 25 . Other methods of dissociation have been developed, such as electron capture dissociation (ECD), surface induced dissociation (SID) and electron transfer dissociation (ETD), but gas-phase CID is the most widely used in commercial tandem mass spectrometers. The dissociation pathways are strongly dependent on the collision energy, but the vast majority of instruments use low-energy CID (<100 eV) 26 ....