Data summaries for on-demand queries over linked data

Abstract. Over the last years the Web of Data has developed into a large compendium of interlinked data sets from multiple domains. Due to the decentralised architecture of this compendium, several of these datasets contain duplicated data. Yet, so far, only little attention has been paid to the effect of duplicated data on federated querying. This work presents DAW, a novel duplicate-aware approach to federated querying over the Web of Data. DAW is based on a combination of min-wise independent permutations and compact data summaries. It can be directly combined with existing federated query engines in order to achieve the same query recall values while querying fewer data sources. We extend three well-known federated query processing engines -DARQ, SPLENDID, and FedX -with DAW and compare our extensions with the original approaches. The comparison shows that DAW can greatly reduce the number of queries sent to the endpoints, while keeping high query recall values. Therefore, it can significantly improve the performance of federated query processing engines. Moreover, DAW provides a source selection mechanism that maximises the query recall, when the query processing is limited to a subset of the sources.

Section: Methodsmentioning

confidence: 99%

“…As a result, certain queries can only be answered by retrieving information from several data sources. This type of queries, called federated queries, are becoming increasingly popular within the Web of Data [1,3,8,9,12,14,21,22].…”

Section: Introductionmentioning

confidence: 99%

DAW: Duplicate-AWare Federated Query Processing over the Web of Data

Saleem

Ngomo

Parreira³

et al. 2013

“…The link traversal strategy assumes that Q contains at least one URI d as "entry point" to G. Starting from triples in T d , G is then searched for results by following links from d to other sources. Instead of exploring sources at runtime, knowledge about (previously processed) Linked Data sources in the form of statistics has been exploited to determine and rank relevant sources [3,10] at query compilation time. Existing approaches assume a source index, which is basically a map that associates a triple pattern q with sources containing triples that match q.…”

Section: Definition 1 (Rdf Triple Rdf Graph) Given a Set Of Uris U mentioning

confidence: 99%

“…In this context, researchers have studied the problem of Linked Data query processing [3,5,6,10,11,16]. Processing structured queries over Linked Data can be seen as a special case of federated query processing.…”

Section: Introductionmentioning

confidence: 99%

Top-k Linked Data Query Processing

Wagner

Duc

Ladwig

et al. 2012

Self Cite

Abstract. In recent years, top-k query processing has attracted much attention in large-scale scenarios, where computing only the k "best" results is often sufficient. One line of research targets the so-called top-k join problem, where the k best final results are obtained through joining partial results. In this paper, we study the top-k join problem in a Linked Data setting, where partial results are located at different sources and can only be accessed via URI lookups. We show how existing work on top-k join processing can be adapted to the Linked Data setting. Further, we elaborate on strategies for a better estimation of scores of unprocessed join results (to obtain tighter bounds for early termination) and for an aggressive pruning of partial results. Based on experiments on real-world Linked Data, we show that the proposed top-k join processing technique substantially improves runtime performance.

“…Notably, histogram approaches generally suffer from the problem that they grow too large or become an insufficiently accurate digest, especially in the face of very heterogeneous data. [5] introduced QTrees, which may alleviate the problem of histogram size, but which may not solve it.…”

Section: In Sparql Federationmentioning

confidence: 99%

Sharing Statistics for SPARQL Federation Optimization, with Emphasis on Benchmark Quality

Kjernsmo

2012

Abstract. Federation of semantic data on SPARQL endpoints will allow data to remain distributed so that it can be controlled by local curators and swiftly updated. There are considerable performance problems, which the present work proposes to address, mainly by computation and exposure of statistical digests to assist selectivity estimation.For an objective evaluation as well as comparison of engines, benchmarks that exhaustively covers the parameter space is required. We propose an investigation into this problem using statistical experimental planning. MotivationQuery federation with SPARQL, which is a standardized query language for the Semantic Web, has attracted much attention from industry and academia alike, and four implementations of basic query federation were submitted to the SPARQL 1.1 Working Group as input for the forthcoming work 1 . This feature was supported by many group members, and the Last Call working draft of the proposed standard was published on 17 November 2011.While the basic feature set of the proposed standard can enable users to create federated queries, it is not of great use as it requires extensive prior knowledge of both the data to be queried and performance characteristics of the involved query engines. Without this knowledge, the overall performance is insufficient for most practical applications.To aid optimization, SPARQL endpoints should expose details about both data and performance characteristics of the engine itself. The proposed work has two focal points: Statistical digests of data for optimizations and benchmarking SPARQL engines.The focus on SPARQL benchmarking is not only motivated from the perspective of optimization, as I have found the current state of the art in SPARQL benchmarking lacking in its use of statistics. The emphasis in the present paper is on statistics in benchmarking with the purpose of providing a firmer foundation on which assertions about engine performance can be backed with evidence.