Sempala: Interactive SPARQL Query Processing on Hadoop

Schätzle, Alexander; Przyjaciel-Zablocki, Martin; Neu, Antony; Lausen, Georg

doi:10.1007/978-3-319-11964-9_11

Cited by 56 publications

(57 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This also applies to the intermediate and final results, which in turn facilitates the compositionality of expressions and provides a simple interoperability with, e.g., Hadoop-based SPARQL engines that can use Parquet as input [19]. We also performed some experiments with others storage formats including RCFile, Avro and SequenceFile.…”

Section: Rdf Data Layoutmentioning

confidence: 97%

TriAL-QL

Przyjaciel-Zablocki

Schätzle

Lausen

2015

Proceedings of the 18th International Workshop on Web and Databases

Self Cite

View full text Add to dashboard Cite

Navigational queries are among the most natural query patterns for RDF data, but yet most existing RDF query languages fail to cover all the varieties inherent to its triplebased model, including SPARQL 1.1 and its derivatives. As a consequence, the development of more expressive RDF languages is of general interest. With TriAL* [14], there exists an expressive algebra which subsumes many previous approaches, while adding novel features that are not expressible in most other RDF query languages based on the standard graph model. However, its algebraic notation is inappropriate for practical usage and it is not supported by any existing RDF triple store. In this paper, we propose TriAL-QL, an easy to write and grasp language for TriAL*, preserving its compositional algebraic structure. We present an implementation based on Impala, a massive parallel SQL query engine on Hadoop, using an optimized semi-naive evaluation for the recursive fragments of TriAL*. This way, we support both data-intensive ETL-like workloads and explorative ad-hoc style queries. To demonstrate the scalability and expressiveness of our approach, we conducted experiments on generated social networks with up to 1.8 billion triples and compared different execution strategies to a Hivebased solution.

show abstract

Section: Rdf Data Layoutmentioning

confidence: 97%

TriAL-QL

Przyjaciel-Zablocki

Schätzle

Lausen

2015

Proceedings of the 18th International Workshop on Web and Databases

Self Cite

View full text Add to dashboard Cite

show abstract

“…Given a set of query templates, the query generator instantiates these templates with actual RDF terms from the dataset. We instantiated 20 of these templates each with 100 queries, so in total we got the 2000 unique queries, more details of the templates can be found on the Watdiv website 12 . Figure 4 shows the overall performance per query template type, while Figure 3 goes into more detail by showing the performance on the 20 query templates.…”

Section: Query Template Type Performancementioning

confidence: 99%

“…• Translating SPARQL and RDF to existing Big Data approaches such as MapReduce [11], Impala [12], Apache Spark [4];…”

Section: Introductionmentioning

confidence: 99%

Big linked data ETL benchmark on cloud commodity hardware

Witte

Vocht

Verborgh

et al. 2016

Proceedings of the International Workshop on Semantic Big Data

View full text Add to dashboard Cite

Linked Data storage solutions often optimize for low latency querying and quick responsiveness. Meanwhile, in the back-end, offline ETL processes take care of integrating and preparing the data. In this paper we explain a workflow and the results of a benchmark that examines which Linked Data storage solution and setup should be chosen for different dataset sizes to optimize the cost-effectiveness of the entire ETL process. The benchmark executes diversified stress tests on the storage solutions. The results include an in-depth analysis of four mature Linked Data solutions with commercial support and full SPARQL 1.1 compliance. Whereas traditional benchmarks studies generally deploy the triple stores on premises using high-end hardware, this benchmark uses publicly available cloud machine images for reproducibility and runs on commodity hardware. All stores are tested using their default configuration. In this setting Virtuoso shows the best performance in general. The other tree stores show competitive results and have disjunct areas of excellence. Finally, it is shown that each store's performance heavily depends on the structural properties of the queries, giving an indication of where vendors can focus their optimization efforts.

show abstract

“…S2RDF does not run on Spark directly; it translates SPARQL queries into SQL jobs which are then executed on top of Spark SQL [19]. S2RDF follows a similar approach to Sempala [50] and PigSPARQL [14]. Sempala is a distributed RDF engine that translates SPARQL into SQL which runs on top of Apache Impala [35].…”

Section: Sophisticated Partitioningmentioning

confidence: 99%

“…Each SPARQL query is decomposed into multiple subqueries, which are then evaluated independently. Since the data is [46] Subject Hash Distributed Semi-Join CliqueSquare [25] Hybrid (Hash + VP) MapReduce-based Join DREAM [38] No partitioning; full replication RDF-3X [53] EAGRE [56] METIS MapReduce-based Join gStoreD [45] Partitioning Agnostic gStore [37] H-RDF-3X [29] METIS RDF-3X [53] H2RDF+ [41] H-Base partitioner (range) Centralized + MapReduce HadoopRDF [30] VP + predicate files on HDFS MapReduce Join Partout [36] Workload-based fragmentation RDF-3X [53] PigSparql [14] Hash + Triple-based files SPARQL to PigLatin S2RDF [15] Extended Vertical Partitioning SPARQL to SQL S2X [51] GraphX partitioning strategy Vertex-Centric BGP matching Sedge [57] Subject Hash Vertex-Centric BGP matching Sempala [50] VP SPARQL to SQL SHAPE [32] Semantic Hash Partitioning RDF-3X [53] SHARD [47] Hash MapReduce-based Join TriAD [48] Hash-based Sharding Distributed Merge/Hash Joins TriAD-SG [48] METIS + Horizontal Sharding Distributed Merge/Hash Joins Trinity.RDF [33] Key-value store on graph Graph Exploration WARP [28] METIS on query workload RDF-3X [53] In this survey, we categorize distributed RDF management systems along 2 dimensions based on their execution model: (i) MapReduce and Graph-based systems: such systems rely on general purpose frameworks, i.e., Hadoop or Spark, that offer seamless data distribution and parallelization at the cost of flexibility. (ii) Specialized RDF systems: are built specifically for SPARQL query evaluation by utilizing custom physical layouts, native RDF indexing, efficient communication protocols and explicit replication.…”

Section: Distributed Rdf Systemsmentioning

confidence: 99%

A survey and experimental comparison of distributed SPARQL engines for very large RDF data

et al. 2017

View full text Add to dashboard Cite

Distributed SPARQL engines promise to support very large RDF datasets by utilizing shared-nothing computer clusters. Some are based on distributed frameworks such as MapReduce; others implement proprietary distributed processing; and some rely on expensive preprocessing for data partitioning. These systems exhibit a variety of trade-offs that are not well-understood, due to the lack of any comprehensive quantitative and qualitative evaluation. In this paper, we present a survey of 22 state-of-the-art systems that cover the entire spectrum of distributed RDF data processing and categorize them by several characteristics. Then, we select 12 representative systems and perform extensive experimental evaluation with respect to preprocessing cost, query performance, scalability and workload adaptability, using a variety of synthetic and real large datasets with up to 4.3 billion triples. Our results provide valuable insights for practitioners to understand the trade-offs for their usage scenarios. Finally, we publish online our evaluation framework, including all datasets and workloads, for researchers to compare their novel systems against the existing ones.

show abstract

Sempala: Interactive SPARQL Query Processing on Hadoop

Cited by 56 publications

References 17 publications

TriAL-QL

TriAL-QL

Big linked data ETL benchmark on cloud commodity hardware

A survey and experimental comparison of distributed SPARQL engines for very large RDF data

Contact Info

Product

Resources

About