A prerequisite for systems biology is the integration and analysis of heterogeneous experimental data stored in hundreds of life-science databases and millions of scientific publications. Several standardised formats for the exchange of specific kinds of biological information exist. Such exchange languages facilitate the integration process; however they are not designed to transport integrated datasets. A format for exchanging integrated datasets needs to i) cover data from a broad range of application domains, ii) be flexible and extensible to combine many different complex data structures, iii) include metadata and semantic definitions, iv) include inferred information, v) identify the original data source for integrated entities and vi) transport large integrated datasets. Unfortunately, none of the exchange formats from the biological domain (e.g. BioPAX, MAGE-ML, PSI-MI, SBML) or the generic approaches (RDF, OWL) fulfil these requirements in a systematic way. We present OXL, a format for the exchange of integrated data sets, and detail how the aforementioned requirements are met within the OXL format. OXL is the native format within the data integration and text mining system ONDEX. Although OXL was developed with the ONDEX system in mind, it also has the potential to be used in several other biological and non-biological applications described in this paper. Availability: The OXL format is an integral part of the ONDEX system which is freely available under the GPL at http://ondex.sourceforge.net/. Sample files can be found at http://prdownloads.sourceforge.net/ondex/ and the XML Schema at http://ondex.svn.sf.net/viewvc/*checkout*/ondex/trunk/backend/data/xml/ondex.xsd.
IntroductionThe importance of database integration for all Life Sciences is generally recognised. Especially in the Pharmaceutical Industry [1; 2] data integration is a crucial technology for the drug discovery process [3; 4]. Although many different approaches to data integration exist, and have been reviewed by [5], there are no standard formats, specifically designed for the exchange of integrated datasets. Thus, users of database integration systems have to rely on the proprietary interfaces and exchange formats from the different data integration platforms. Although the use of XML, RDF and OWL simplifies the exchange of integrated datasets, none of the existing XML, RDF and OWL based formats is suitable as a generic format for the exchange of integrated datasets.Data integration has to deal with a broad range of heterogeneous data sources. Traditionally, databases were distributed using proprietary flatfile formats or tab delimited database dumps. It is a popular myth that the appearance of XML has made these formats obsolete, however our experience shows that still only about 5% of all databases provide an XML based format * To whom correspondence should be addressed. [6]. Several databases are still exclusively distributed in proprietary flatfile formats or as database dumps. A high percentage of these databases provide no ...