Robust and Efficient Large-Large Table Outer Joins on Distributed Infrastructures

Proceedings of the 25th ACM Conference on Hypertext and Social Media

Kotoulas

Ward

et al. 2014

Self Cite

The full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or charge, for personal research or study, educational, or not-for-prot purposes provided that:• a full bibliographic reference is made to the original source • a link is made to the metadata record in DRO • the full-text is not changed in any way The full-text must not be sold in any format or medium without the formal permission of the copyright holders.Please consult the full DRO policy for further details. ABSTRACTWe propose an efficient method for fast processing large RDF data over distributed memory. Our approach adopts a two-tier index architecture on each computation node: (1) a light-weight primary index, to keep loading times low, and (2) a dynamic, multi-level secondary index, calculated as a by-product of query execution, to decrease or remove inter-machine data movement for subsequent queries that contain the same graph patterns. Experimental results on a commodity cluster show that we can load large RDF data very quickly in memory while remaining within an interactive range for query processing with the secondary index.

Section: Resultsmentioning

confidence: 99%

A two-tier index architecture for fast processing large RDF data over distributed memory

Proceedings of the 25th ACM Conference on Hypertext and Social Media

Kotoulas

Ward

et al. 2014

Self Cite

“…(2) Cardinality: To see how the performance changes with increasing dataset size, similar as the evaluation works on joins [13] [14], we just fix the cardinality of relation R, to 25 million, and varying the |S| from 25 million to 400 million, a number which is extreme big for the available annotation pairs. As the skew handling is beyond the scope of this work, we just keep the data uniform distributed based on their first join key a as stated previously.…”

Section: A Benchmark Scenariosmentioning

confidence: 99%

Investigating Distributed Approaches to Efficiently Extract Textual Evidences for Biomedical Ontologies

2014 IEEE International Conference on Bioinformatics and Bioengineering

2014

Self Cite

Heterogeneous data resources in biomedicine become available both in structured and unstructured formats, such as scientific publications, healthcare guidelines, controlled vocabularies, and formal ontologies. Bridging the gaps among these heterogeneous data is useful to discovery implicit knowledge. To make this happen, efficient computational approaches are a necessity for applications in such a knowledge-and dataintensive domain. In this paper, we first define a particular task, relation alignment, which is to identify textual evidences for biomedical ontologies. Then, we investigate two parallel approaches for this task over distributed systems and present the details of their implementations. Moreover, we characterize the performance of our methods through extensive experiments, thereby allowing researchers to make a more informed choice in the presence of large-scale biomedical data.

“…Compared to these, in our previous work [7,8,9], we have employed the semijoin-alike pattern with full parallelism as a new distributed geography (namely not just a simple join operation) for handling data skew and apply it for parallel inner joins and outer joins directly. In this work, we focus on the inner joins (namely joins).…”

Section: Related Workmentioning

confidence: 99%

“…We conclude our analysis with the presentation of speedup using the very popular Hash algorithm as a baseline 9 , by analyzing the performance improvement achieved for joins in each algorithm for different numbers of nodes. Figure 9 presents the speedup ratio of PRPD, PRPQ and the Query algorithm over the basic hash method with increasing number of nodes from 2 (24 cores) to 16 and for skew values 1 and 1.4 respectively.…”

Section: Comparison With Hash-based Joinsmentioning

confidence: 99%

Robust and Skew-resistant Parallel Joins in Shared-Nothing Systems

Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

Kotoulas

Ward

et al. 2014

Self Cite

The full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or charge, for personal research or study, educational, or not-for-prot purposes provided that:• a full bibliographic reference is made to the original source • a link is made to the metadata record in DRO • the full-text is not changed in any way The full-text must not be sold in any format or medium without the formal permission of the copyright holders.Please consult the full DRO policy for further details. ABSTRACTThe performance of joins in parallel database management systems is critical for data intensive operations such as querying. Since data skew is common in many applications, poorly engineered join operations result in load imbalance and performance bottlenecks. State-of-the-art methods designed to handle this problem offer significant improvements over naive implementations. However, performance could be further improved by removing the dependency on global skew knowledge and broadcasting. In this paper, we propose PRPQ (partial redistribution & partial query), an efficient and robust join algorithm for processing large-scale joins over distributed systems. We present the detailed implementation and a quantitative evaluation of our method. The experimental results demonstrate that the proposed PRPQ algorithm is indeed robust and scalable under a wide range of skew conditions. Specifically, compared to the state-ofart PRPD method, we achieve 16% − 167% performance improvement and 24% − 54% less network communication under different join workloads.