As the capability of mass spectrometry-based proteomics has matured, tens of thousands of peptides can be measured simultaneously, which has the benefit of offering a systems view of protein expression. However, a major challenge is that, with an increase in throughput, protein quantification estimation from the native measured peptides has become a computational task. A limitation to existing computationally driven protein quantification methods is that most ignore protein variation, such as alternate splicing of the RNA transcript and post-translational modifications or other possible proteoforms, which will affect a significant fraction of the proteome. The consequence of this assumption is that statistical inference at the protein level, and conse- The application of MS-based proteomics has resulted in large-scale studies in which the set of measured, and subsequently identified, peptides is often used to estimate protein abundance. In particular, label-free MS-based proteomics is highly effective for identification of peptides and measurement of relative peptide abundances (1, 2), but it does not directly yield protein quantities. The importance of accurate protein quantification cannot be understated; it is the essential component of identifying biomarkers of disease or defining the relationship between gene regulations, protein interactions, and signaling networks in a cellular system (3, 4). The major challenge is that protein abundance depends not only on transcription rates of the gene but also on additional control mechanisms, such as mRNA stability, translational regulation, and protein degradation. Moreover, the functional activity of proteins can be altered through a variety of posttranslational modifications or proteolytic processing and alternative splicing, events which selectively alter the abundance of some selected peptides while leaving others unchanged (4). This complexity of the proteome, in addition to issues associated with the measurement and identification
Ensuring data quality and proper
instrument functionality is a
prerequisite for scientific investigation. Manual quality assurance
is time-consuming and subjective. Metrics for describing liquid chromatography
mass spectrometry (LC–MS) data have been developed; however,
the wide variety of LC–MS instruments and configurations precludes
applying a simple cutoff. Using 1150 manually classified quality control
(QC) data sets, we trained logistic regression classification models
to predict whether a data set is in or out of control. Model parameters
were optimized by minimizing a loss function that accounts for the
trade-off between false positive and false negative errors. The classifier
models detected bad data sets with high sensitivity while maintaining
high specificity. Moreover, the composite classifier was dramatically
more specific than single metrics. Finally, we evaluated the performance
of the classifier on a separate validation set where it performed
comparably to the results for the testing/training data sets. By presenting
the methods and software used to create the classifier, other groups
can create a classifier for their specific QC regimen, which is highly
variable lab-to-lab. In total, this manuscript presents 3400 LC–MS
data sets for the same QC sample (whole cell lysate of Shewanella
oneidensis), deposited to the ProteomeXchange with identifiers
PXD000320–PXD000324.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.