Abstract. Publishing interlinked RDF datasets as links between data items identified using dereferenceable URIs on the web brings forward a number of issues. A key challenge is to understand the data, the schema, and the interlinks that are actually used both within and across linked datasets. Understanding actual RDF usage is critical in the increasingly common situations where terms from different vocabularies are mixed. In this paper we describe a tool, ExpLOD, that supports exploring summaries of RDF usage and interlinking among datasets from the Linked Open Data cloud. ExpLOD's summaries are based on a novel mechanism that combines text labels and bisimulation contractions. The labels assigned to RDF graphs are hierarchical, enabling summarization at different granularities. The bisimulation contractions are applied to subgraphs defined via queries, providing for summarization of arbitrary large or small graph neighbourhoods. Also, ExpLOD can generate SPARQL queries from a summary. Experimental results, using several collections from the Linked Open Data cloud, compare the two summary creation approaches implemented by ExpLOD (graph-based vs. SPARQL-based).
Bisimulation summaries of graph data have multiple applications, including facilitating graph exploration and enabling query optimization techniques, but efficient, scalable, summary construction is challenging. The literature describes parallel construction algorithms using message-passing, and these have been recently adapted to MapReduce environments. The fixpoint nature of bisimulation is well suited to iterative graph processing, but the existing MapReduce solutions do not drastically decrease per-iteration times as the computation progresses.In this paper, we focus on leveraging parallel multi-core graph frameworks with the goal of constructing summaries in roughly the same amount of time that it takes to input the data into the framework (for a range of real world data graphs) and output the summary. To achieve our goal we introduce a singleton optimization that significantly reduces per-iteration times after only a few iterations. We present experimental results validating that our scalable GraphChi implementation achieves our goal with bisimulation summaries of million to billion edge graphs.
Summary PSI-MI has been endorsed by the protein informatics community as a standard XML data exchange format for protein-protein interaction datasets. While many public databases support the standard, there is a degree of heterogeneity in the way the proposed XML schema is interpreted and instantiated by different data providers. Analysis of schema instantiation in large collections of XML data is a challenging task that is unsupported by existing tools.In this study we use DescribeX, a novel visualization technique of (semi-)structured XML formats, to quantitatively and qualitatively analyze PSI-MI XML collections at the instance level with the goal of gaining insights about schema usage and to study specific questions such as: adequacy of controlled vocabularies, detection of common instance patterns, and evolution of different data collections. Our analysis shows DescribeX enhances understanding the instance-level structure of PSI-MI data sources and is a useful tool for standards designers, software developers, and PSI-MI data providers.
Abstract-DescribeX is a visual, interactive tool for exploring the underlying structure of an XML collection. DescribeX implements a framework for creating XML summaries described using axis path regular expressions (abbreviated AxPRE). AxPRE's capture all the bisimilarity-based proposals in the summary literature and they can be used to define new and more expressive summaries. This demonstration shows how DescribeX helps to analyze diverse XML collections in one particular scenario: the analysis of protein-protein interaction XML data from multiple providers that conform to the PSI-MI schema. I. OVERVIEWXML has been adopted as the standard format for numerous applications in data exchange, web-based feeds (blogs, news feeds, podcasts), hypertext collections, and web services. XML schemas are used across different application domains for validating domain-specific XML instances. Schema validation provides a strong basis from which to structure, author and interpret XML data. However, even though two XML collections can be validated against a common schema, the actual structure of the XML instances may be quite different in each of the two collections. This situation may occur because the common schema is extended to allow different user communities to combine schemas freely (e.g., RSS extensions like Yahoo! Media), or document designers may restrict themselves to use just a subset of a larger schema (e.g., best practice guidelines of industry standards like those for IXRetail 1 ). In these scenarios, schemas do not provide sufficient information for understanding the structural commonalities of a given collection.DescribeX is a visual, interactive tool for exploring the underlying structure of an XML collection, capable of handling gigabyte-size datasets. DescribeX is based on a framework (presented in [1] and [2]) for creating XML summaries based on axis path regular expressions (AxPRE, for short). DescribeX summaries are specified by a partition created using the novel notion of bisimilarity applied to subgraphs described by an AxPRE. The elements in the extent of a given partition (represented by a node in the summary) can be computed by an XPath query that is constructed by DescribeX. By employing different AxPREs to define the summary partition, DescribeX can capture all the bisimilarity-based proposals in the existing literature, plus it can also define new and more expressive summaries.The graph based visualization employed by DescribeX makes it straightforward to see the different path structures 1 http://www.nrf-arts.org/ that are present in the collection. The application of local node refinements (ie, changing an AxPRE at a given summary node to a different, more detailed AxPRE) can reveal detailed substructure variations. DescribeX functionality helps a user in quickly understanding what parts of the schema are used in practice. Further analysis to find the most common structures and substructures can then be performed in DescribeX through the application of coverage. This provides a strong indicati...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.