Abstract. Federation of semantic data on SPARQL endpoints will allow data to remain distributed so that it can be controlled by local curators and swiftly updated. There are considerable performance problems, which the present work proposes to address, mainly by computation and exposure of statistical digests to assist selectivity estimation.For an objective evaluation as well as comparison of engines, benchmarks that exhaustively covers the parameter space is required. We propose an investigation into this problem using statistical experimental planning.
MotivationQuery federation with SPARQL, which is a standardized query language for the Semantic Web, has attracted much attention from industry and academia alike, and four implementations of basic query federation were submitted to the SPARQL 1.1 Working Group as input for the forthcoming work 1 . This feature was supported by many group members, and the Last Call working draft of the proposed standard was published on 17 November 2011.While the basic feature set of the proposed standard can enable users to create federated queries, it is not of great use as it requires extensive prior knowledge of both the data to be queried and performance characteristics of the involved query engines. Without this knowledge, the overall performance is insufficient for most practical applications.To aid optimization, SPARQL endpoints should expose details about both data and performance characteristics of the engine itself. The proposed work has two focal points: Statistical digests of data for optimizations and benchmarking SPARQL engines.The focus on SPARQL benchmarking is not only motivated from the perspective of optimization, as I have found the current state of the art in SPARQL benchmarking lacking in its use of statistics. The emphasis in the present paper is on statistics in benchmarking with the purpose of providing a firmer foundation on which assertions about engine performance can be backed with evidence.