Advances in proteomics technologies have enabled novel protein interactions to be detected at high speed, but they come at the expense of relatively low quality. Therefore, a crucial step in utilizing the high throughput protein interaction data is evaluating their confidence and then separating the subsets of reliable interactions from the background noise for further analyses. Using Bayesian network approaches, we combine multiple heterogeneous biological evidences, including model organism protein-protein interaction, interaction domain, functional annotation, gene expression, genome context, and network topology structure, to assign reliability to the human protein-protein interactions identified by high throughput experiments. This method shows high sensitivity and specificity to predict true interactions from the human high throughput protein-protein interaction data sets. This method has been developed into an on-line confidence scoring system specifically for the human high throughput protein-protein interactions. Users may submit their protein-protein interaction data on line, and the detailed information about the supporting evidence for query interactions together with the confidence scores will be returned. The Web interface of PRINCESS (protein interaction confidence evaluation system with multiple data sources) is available at the website of China Human Proteome Organisation.
Molecular & Cellular Proteomics 7:1043-1052, 2008.Protein-protein interactions play important roles in defining most cellular functions (1-2). Traditionally protein interactions are studied individually by top-down, hypothesis-driven approaches with experiments designed to derive high quality detailed interaction information. Recently advances in proteomics technologies have enabled a large number of novel protein interactions to be detected at an unexpected speed by yeast two-hybrid screens (3-8) and tandem affinity purification (9, 10). Compared with the traditional approaches, high throughput approaches always result in potentially erroneous data sets. For example, von Mering et al. (11) estimated that approximately half of the interactions obtained from high throughput experiments might be false positives. These false positives may connect the unrelated proteins, complicating and even misleading the elucidation of biological significance (12, 13). Therefore, a crucial step in analyzing a high throughput protein interaction data set (HTPID) 1 is evaluating the reliability of the interactions and then separating the subset of credible interactions from background noise.Several methods have been developed previously to predict the true protein interactions from the high throughput protein interaction data sets, such as data set intersection (11,14), homologous interaction (15, 16), interacting domains (17), functional similarity (3, 18), gene coexpression (19,20), and protein interaction network topology (21-23). Most of these methods are based on a single biological evidence. Although these "Single Evidence Models" have b...