S2X: Graph-Parallel Querying of RDF with GraphX

Schätzle, Alexander; Przyjaciel-Zablocki, Martin; Berberich, Thorsten; Lausen, Georg

doi:10.1007/978-3-319-41576-5_12

Cited by 52 publications

(38 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The results of this experiment include a comparative evaluation of our method against four state-of-the-art public diskbased distributed RDF systems proposed in the most recent three years, including DREAM [7], S2X [19], S2RDF [20], and CliqueSquare [4], which are provided by [1]. Other distributed RDF systems in the most recent three years are either unreleased, or are memory-based systems that are in different environments than targeted in this study.…”

Section: F Online Performance Comparisonmentioning

confidence: 99%

See 1 more Smart Citation

Accelerating Partial Evaluation in Distributed SPARQL Query Evaluation

Peng

Zou

Guan

2019

2019 IEEE 35th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

Partial evaluation has recently been used for processing SPARQL queries over a large resource description framework (RDF) graph in a distributed environment. However, the previous approach is inefficient when dealing with complex queries. In this study, we further improve the "partial evaluation and assembly" framework for answering SPARQL queries over a distributed RDF graph, while providing performance guarantees.Our key idea is to explore the intrinsic structural characteristics of partial matches to filter out irrelevant partial results, while providing performance guarantees on a network trace (data shipment) or the computational cost (response time). We also propose an efficient assembly algorithm to utilize the characteristics of partial matches to merge them and form final results. To improve the efficiency of finding partial matches further, we propose an optimization that communicates variables' candidates among sites to avoid redundant computations. In addition, although our approach is partitioning-tolerant, different partitioning strategies result in different performances, and we evaluate different partitioning strategies for our approach. Experiments over both real and synthetic RDF datasets confirm the superiority of our approach.

show abstract

Section: F Online Performance Comparisonmentioning

confidence: 99%

“…First, some recent works (e.g., [4], [20], [19]) focus on managing RDF datasets using cloud platforms. CliqueSquare [4] discusses how to build query plans by relying on n-ary (star) equality joins in Hadoop.…”

Section: Related Workmentioning

confidence: 99%

Accelerating Partial Evaluation in Distributed SPARQL Query Evaluation

Peng

Zou

Guan

2019

2019 IEEE 35th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

show abstract

“…S2X [51] exploits the inherited graph structure of RDF to process SPARQL as graph-based computations on top of GraphX. It uses the parallel vertex-centric model to evaluate the BGP matching of SPARQL while other operators, such as OPTIONAL and FILTER, are processed through Spark RDD operators.…”

Section: Mapreduce and Graph Based Systemsmentioning

confidence: 99%

“…Each SPARQL query is decomposed into multiple subqueries, which are then evaluated independently. Since the data is [46] Subject Hash Distributed Semi-Join CliqueSquare [25] Hybrid (Hash + VP) MapReduce-based Join DREAM [38] No partitioning; full replication RDF-3X [53] EAGRE [56] METIS MapReduce-based Join gStoreD [45] Partitioning Agnostic gStore [37] H-RDF-3X [29] METIS RDF-3X [53] H2RDF+ [41] H-Base partitioner (range) Centralized + MapReduce HadoopRDF [30] VP + predicate files on HDFS MapReduce Join Partout [36] Workload-based fragmentation RDF-3X [53] PigSparql [14] Hash + Triple-based files SPARQL to PigLatin S2RDF [15] Extended Vertical Partitioning SPARQL to SQL S2X [51] GraphX partitioning strategy Vertex-Centric BGP matching Sedge [57] Subject Hash Vertex-Centric BGP matching Sempala [50] VP SPARQL to SQL SHAPE [32] Semantic Hash Partitioning RDF-3X [53] SHARD [47] Hash MapReduce-based Join TriAD [48] Hash-based Sharding Distributed Merge/Hash Joins TriAD-SG [48] METIS + Horizontal Sharding Distributed Merge/Hash Joins Trinity.RDF [33] Key-value store on graph Graph Exploration WARP [28] METIS on query workload RDF-3X [53] In this survey, we categorize distributed RDF management systems along 2 dimensions based on their execution model: (i) MapReduce and Graph-based systems: such systems rely on general purpose frameworks, i.e., Hadoop or Spark, that offer seamless data distribution and parallelization at the cost of flexibility. (ii) Specialized RDF systems: are built specifically for SPARQL query evaluation by utilizing custom physical layouts, native RDF indexing, efficient communication protocols and explicit replication.…”

Section: Distributed Rdf Systemsmentioning

confidence: 99%

“…Then, we perform extensive experimental evaluation of the following 12 representative systems: S2RDF [15], AdPart [46], DREAM [38], Urika-GD [11], CliqueSquare [25], S2X [51], TriAD [48], SHAPE [32], H-RDF-3X [29], H2RDF+ [41], SHARD [47] and gStoreD [45]. We use all the standard synthetic benchmarks (e.g., LUBM [6]) and a variety of very large real datasets (e.g., Bio2RDF [3]) with up to 4.3 billion triples, to stretch the systems to their limits.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A survey and experimental comparison of distributed SPARQL engines for very large RDF data

et al. 2017

View full text Add to dashboard Cite

Distributed SPARQL engines promise to support very large RDF datasets by utilizing shared-nothing computer clusters. Some are based on distributed frameworks such as MapReduce; others implement proprietary distributed processing; and some rely on expensive preprocessing for data partitioning. These systems exhibit a variety of trade-offs that are not well-understood, due to the lack of any comprehensive quantitative and qualitative evaluation. In this paper, we present a survey of 22 state-of-the-art systems that cover the entire spectrum of distributed RDF data processing and categorize them by several characteristics. Then, we select 12 representative systems and perform extensive experimental evaluation with respect to preprocessing cost, query performance, scalability and workload adaptability, using a variety of synthetic and real large datasets with up to 4.3 billion triples. Our results provide valuable insights for practitioners to understand the trade-offs for their usage scenarios. Finally, we publish online our evaluation framework, including all datasets and workloads, for researchers to compare their novel systems against the existing ones.

show abstract

Storing and Querying Semantic Data in the Cloud

Janke

Staab

2018

Lecture Notes in Computer Science

View full text Add to dashboard Cite

In the last years, huge RDF graphs with trillions of triples were created. To be able to process this huge amount of data, scalable RDF stores are used, in which graph data is distributed over compute and storage nodes for scaling efforts of query processing and memory needs. The main challenges to be investigated for the development of such RDF stores in the cloud are: (i) strategies for data placement over compute and storage nodes, (ii) strategies for distributed query processing, and (iii) strategies for handling failure of compute and storage nodes. In this manuscript, we give an overview of how these challenges are addressed by scalable RDF stores in the cloud. 8 We adapted the definition of an RDF molecule in [38] to allow for paths with a length ≥ 1. 9 The term anchor vertex was taken from [79]. 10 dom(µ) refers to the set of variables of this binding.

show abstract

S2X: Graph-Parallel Querying of RDF with GraphX

Cited by 52 publications

References 5 publications

Accelerating Partial Evaluation in Distributed SPARQL Query Evaluation

Accelerating Partial Evaluation in Distributed SPARQL Query Evaluation

A survey and experimental comparison of distributed SPARQL engines for very large RDF data

Storing and Querying Semantic Data in the Cloud

Contact Info

Product

Resources

About