quantifying small fold-changes. This is crucial to detect subtle, system-wide effects of a perturbation on the protein network. Second, HD refers to the number of observations (pixels) available for each protein. As more perturbations are analysed, regulatory patterns become more refined and can be detected more accurately.To assemble ProteomeHD we processed the raw data from 5,288 individual mass-spectrometry runs into one coherent data matrix, which covers 10,323 proteins (from 9,987 genes) and 294 biological conditions ( Supplementary Table 1). About 20% of the experiments were performed in our laboratory and the remaining data were collected from the Proteomics Identifications (PRIDE) 46 repository (Fig. 1a). The data cover a wide array of quantitative proteomics experiments, such as perturbations with drugs and growth factors, genetic perturbations, cell differentiation studies and comparisons of cancer cell lines (Supplementary Table 2). All experiments are comparative studies using SILAC 45 , i.e. they do not report absolute protein concentrations but highly accurate fold-changes in response to perturbation. About 60% of the included experiments analysed whole-cell samples. The remaining measurements were performed on samples that had been fractionated after perturbation, e.g. to enrich for chromatin-based or secreted proteins. This allows for the detection of low-abundance proteins that may not be detected in whole-cell lysates.
ProteomeHD offers high protein coverageOn average, the 10,323 human proteins in ProteomeHD were quantified on the basis of 28.4 peptides and a sequence coverage of 49% ( Supplementary Fig. 1). As expected from shotgun proteomics data, not every protein is quantified in every condition. The 294 input experiments quantify 3,928 proteins on average. Each protein is quantified, on average, in 112 biological conditions ( Supplementary Fig. 1). As a rule of thumb, coexpression studies discard transcripts detected in less than half of the samples. However, with 294 conditions ProteomeHD is considerably larger than the typical coexpression analysis. We therefore decided to use a lower arbitrary cut-off and include proteins for downstream analysis if they were quantified in about a third of the conditions. Specifically, we focus our co-regulation analysis on the 5,013 proteins that were quantified in at least 95 of the 294 perturbation experiments. On average, these 5,013 proteins were quantified in 190 conditions; 43% were quantified in more than 200 conditions ( Supplementary Fig. 1).
Machine-learning captures functional protein associationsProteins that are functioning together have similar patterns of up-and down regulation across the many conditions and samples in ProteomeHD. For example, the patterns of proteins belonging to two well-known biological processes, oxidative phosphorylation and rRNA processing, can be clearly distinguished, even though most expression changes are well below 2-fold ( Fig. 1b). Therefore, we reasoned that it should be possible to reveal functional links be...