Clause-iteration with MapReduce to scalably query datagraphs in the SHARD graph-store

Rohloff, Kurt; Schantz, Richard

doi:10.1145/1996014.1996021

Cited by 59 publications

(61 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The first category generally partitions an RDF dataset across multiple servers using horizontal (random) partitioning, stores partitions using distributed file systems such as Hadoop Distributed File System (HDFS), and processes queries by parallel access to the clustered servers using distributed programming model such as Hadoop MapReduce [20,12]. SHARD [20] directly stores RDF triples in HDFS as flat text files and runs one Hadoop job for each clause (triple pattern) of a SPARQL query.…”

Section: Related Workmentioning

confidence: 99%

“…SHARD [20] directly stores RDF triples in HDFS as flat text files and runs one Hadoop job for each clause (triple pattern) of a SPARQL query. [12] stores RDF triples in HDFS by hashing on predicates and runs one Hadoop job for each join of a SPARQL query.…”

Section: Related Workmentioning

confidence: 99%

“…The map function examines each baseline partition and reads a (anchor vertex, border vertex) pair, and emits a key-value pair in which the key is the border vertex and the value is the anchor vertex (line [17][18][19][20][21][22][23][24][25]. During the shuffling phase, a set of anchor vertices which have the same border vertex are grouped together.…”

Section: Algorithm and Implementationmentioning

confidence: 99%

See 2 more Smart Citations

Scaling queries over big RDF graphs with semantic hash partitioning

2013

View full text Add to dashboard Cite

Massive volumes of big RDF data are growing beyond the performance capacity of conventional RDF data management systems operating on a single node. Applications using large RDF data demand efficient data partitioning solutions for supporting RDF data access on a cluster of compute nodes. In this paper we present a novel semantic hash partitioning approach and implement a Semantic HAsh Partitioning-Enabled distributed RDF data management system, called Shape. This paper makes three original contributions. First, the semantic hash partitioning approach we propose extends the simple hash partitioning method through direction-based triple groups and direction-based triple replications. The latter enhances the former by controlled data replication through intelligent utilization of data access locality, such that queries over big RDF graphs can be processed with zero or very small amount of inter-machine communication cost. Second, we generate locality-optimized query execution plans that are more efficient than popular multi-node RDF data management systems by effectively minimizing the inter-machine communication cost for query processing. Third but not the least, we provide a suite of locality-aware optimization techniques to further reduce the partition size and cut down on the inter-machine communication cost during distributed query processing. Experimental results show that our system scales well and can process big RDF datasets more efficiently than existing approaches.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Algorithm and Implementationmentioning

confidence: 99%

See 1 more Smart Citation

Scaling queries over big RDF graphs with semantic hash partitioning

2013

View full text Add to dashboard Cite

show abstract

“…Hadoop based RDF data systems, such as [16], [23], [24] directly store RDF data as HDFS files, and distribute these files by using the file partitioning and placement policies in the vanilla Hadoop. However, previous studies [9], [17] showed that, without carefully designed data partitioning algorithms and data localization strategies, massive I/O cost and communication overhead would be incurred in these kind of systems.…”

Section: Related Workmentioning

confidence: 99%

“…A popular approach to partition RDF data is hash partitioning, which is adopted by a majority of the existing distributed RDF engines [13], [14], [18], [24]. This approach distributes RDF triples across different partitions by computing a hash key over either the subject or the object of each triple.…”

Section: Introduction Rdf (Resource Description Framework)mentioning

confidence: 99%

Scalable SPARQL querying using path partitioning

Zhou

Yuan

et al. 2015

2015 IEEE 31st International Conference on Data Engineering

View full text Add to dashboard Cite

Abstract-The emerging need for conducting complex analysis over big RDF datasets calls for scale-out solutions that can harness a computing cluster to process big RDF datasets. Queries over RDF data often involve complex self-joins, which would be very expensive to run if the data are not carefully partitioned across the cluster and hence distributed joins over massive amount of data are necessary. Existing RDF data partitioning methods can nicely localize simple queries but still need to resort to expensive distributed joins for more complex queries. In this paper, we propose a new data partitioning approach that takes use of the rich structural information in RDF datasets and minimizes the amount of data that have to be joined across different computing nodes. We conduct an extensive experimental study using two popular RDF benchmark data and one real RDF dataset that contain up to billions of RDF triples. The results indicate that our approach can produce a balanced and low redundant data partitioning scheme that can avoid or largely reduce the cost of distributed joins even for very complicated queries. In terms of query execution time, our approach can outperform the state-of-the-art methods by orders of magnitude.

show abstract

BRGP: a balanced RDF graph partitioning algorithm for cloud storage

Leng

Chen

Zhong

et al. 2016

Concurrency and Computation

View full text Add to dashboard Cite

Summary The continuous growth of resource description framework (RDF) data poses an important challenge on RDF data partitioning that is a vital technique for effective cloud storage. Recently, many partitioning algorithms for large RDF data have been developed, and most of them are based on graph partitioning. However, existing graph partitioning methods could not partition asymmetric RDF data effectively, resulting in a lower performance for cloud storage. This paper proposes a balanced RDF graph partitioning algorithm for storing massive RDF data on cloud. We first devise a modularity‐based multi‐level label propagation algorithm (MMLP) to partition RDF graph roughly and then use a balanced K‐mediods clustering algorithm for final k‐way partitioning. Balanced RDF graph partitioning algorithm designs an effective label update rule and a balanced modification strategy to achieve a high quality coarsening result and make the partition as equilibrium as possible. Experiments are carried on two representative RDF benchmarks and one real RDF dataset by comparison with two representative graph partitioning methods, that is, METIS and MLP+METIS. Results demonstrate that our proposed scheme can produce a high‐quality partition for massive RDF data storage on cloud. Copyright © 2016 John Wiley & Sons, Ltd.

show abstract

Clause-iteration with MapReduce to scalably query datagraphs in the SHARD graph-store

Cited by 59 publications

References 13 publications

Scaling queries over big RDF graphs with semantic hash partitioning

Scaling queries over big RDF graphs with semantic hash partitioning

Scalable SPARQL querying using path partitioning

BRGP: a balanced RDF graph partitioning algorithm for cloud storage

Contact Info

Product

Resources

About