Abstract. Protein-protein interactions are critical to many biological processes, extending from the formation of cellular macromolecular structures and enzymatic complexes to the regulation of signal transduction pathways. With the availability of complete genome sequences, several groups have begun large-scale identification and characterization of such interactions, relying mostly on high-throughput two-hybrid systems. We collaborate with one such group, led by Marc Vidal, whose aim is the construction of a protein-protein interaction map for C. elegans. In this paper we first describe WISTdb, a database designed to store the interaction data generated in Marc Vidal's laboratory. We then describe InterDB, a multi-organism prediction-oriented database of protein-protein interactions. We finally discuss our current approaches, based on inductive logic programming and on a data mining technique, for extracting predictive rules from the collected data.
The Biological Problem: Protein-Protein InteractionsProtein-protein interactions are critical to many biological processes, extending from the formation of cellular macromolecular structures and enzymatic complexes to the regulation of signal transduction pathways.With the availability of complete genome sequences, several groups have begun large-scale identification and characterization of such interactions [11], [22], [25]. These groups rely mostly on high-throughput two-hybrid systems [23]. Although such approaches significantly increase the rate at which interaction data is produced, they will require several years to produce full interaction maps for modest-sized organisms, whereas the "working draft" of the human genome has been available since June 2000. It is therefore enticing and promising to develop computational methods that could predict protein-protein interactions, be it in a rough and approximate manner. Ideally, the data produced by those highthroughput projects could suffice to develop prediction algorithms that could then be applied to genome sequence as fast as it is being released. More reasonably, the high-throughput projects themselves could benefit from predictions to speed up the discovery of interesting protein-protein interactions (see part 4).In this paper we concentrate on a preliminary step to study protein-protein interactions in Caenorhabditis elegans. C. elegans is the first multi-cellular organism whose genome has been completely sequenced [5], as well as being a choice