The combined method of LC-MS/MS is increasingly being used to explore differences in the proteomic composition of complex biological systems. The reliability and utility of such comparative protein expression profiling studies is critically dependent on an accurate and rigorous assessment of quantitative changes in the relative abundance of the myriad of proteins typically present in a biological sample such as blood or tissue. In this review, we provide an overview of key statistical and computational issues relevant to bottom-up shotgun global proteomic analysis, with an emphasis on methods that can be applied to improve the dependability of biological inferences drawn from large proteomic datasets. Focusing on a start-tofinish approach, we address the following topics: 1) lowlevel data processing steps, such as formation of a data matrix, filtering, and baseline subtraction to minimize noise, 2) mid-level processing steps, such as data normalization, alignment in time, peak detection, peak quantification, peak matching, and error models, to facilitate profile comparisons; and, 3) high-level processing steps such as sample classification and biomarker discovery, and related topics such as significance testing, multiple testing, and choice of feature space. We report on approaches that have recently been developed for these steps, discussing their merits and limitations, and propose areas deserving of further research.
Molecular & Cellular Proteomics 4:419 -434, 2005.With the sequencing of the human genome largely complete and publicly available, emphasis in molecular biology is shifting away from DNA sequencing and related problems toward a systematic evaluation of how the myriad of encoded gene products operate together to mediate the biological mechanisms that sustain life, and how these processes become perturbed in response to disease. Comprehensive systems-wide biological studies have been greatly facilitated by the advent of large-scale genomic, proteomic, and informatic technologies, such as DNA microarrays, ultra-sensitive highthroughput protein MS, and robust statistical and machinelearning methods developed for very large datasets. Evaluation, interpretation, and integration of data produced by these respective platforms represent major ongoing challenges and areas of active research.The field of expression proteomics seeks to answer the following questions: 1) which proteins and variant isoforms are expressed during the lifecycle of an organism; 2) which post-translational modifications occur in each of these proteins; 3) how do these patterns differ in different cell types and tissues and under different developmental, physiological, and disease conditions; and 4) how can biologists make use of this information to better understand the molecular basis for fundamental biological processes as well as for monitoring the course of disease so as to improve clinical diagnosis and treatment (1-3). These questions are made all the more difficult by the complexity of most biological systems, which increases...