Load balancing and skew resilience for parallel joins

Vitorovic, Aleksandar; Elseidy, Mohammed

doi:10.1109/icde.2016.7498250

Cited by 16 publications

(9 citation statements)

References 28 publications

(28 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In data stream processing, DYNAMIC [4] supports adaptive repartitioning according to the change of data streams. To ensure the load balancing and skew resilience, Aleksandar el.al [6] proposed a multi-stage load-balancing algorithm by using a novel category of equi-weight histograms. However, [4,6] assumes that the number of partitions must be 2 n .…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

D-JB: An Online Join Method for Skewed and Varied Data Streams

Wang

Feng

Shi

2018

Intelligence Science II

View full text Add to dashboard Cite

Scalable distributed join processing in a parallel environment requires a partitioning policy to transfer data. Online theta-joins over data streams are more computationally expensive and impose higher memory requirement in distributed data stream management systems (DDSMS) than database management systems (DBMS). The complete bipartite graph-based model can support distributed stream joins, and has the characteristics of memory-efficiency, elasticity and scalability. However, due to the instability of data stream rate and the imbalance of attribute value distribution, the online theta-joins over skewed and varied streams lead to the load imbalance of cluster. In this paper, we present a framework D-JB (Dynamic Join Biclique) for handling skewed and varied streams, enhancing the adaptability of the join model and minimizing the system cost based on the varying workloads. Our proposal includes a mixed key-based and tuple-based partitioning scheme to handle skewed data in each side of the bipartite graph-based model, a strategy for redistribution of query nodes in two sides of this model, and a migration algorithm about state consistency to support full-history joins. Experiments show that our method can effectively handle skewed and varied data streams and improve the throughput of DDSMS.

show abstract

Section: Related Workmentioning

confidence: 99%

“…To ensure the load balancing and skew resilience, Aleksandar el.al [6] proposed a multi-stage load-balancing algorithm by using a novel category of equi-weight histograms. However, [4,6] assumes that the number of partitions must be 2 n . So, the matrix structure suffers from bad flexibility.…”

Section: Related Workmentioning

confidence: 99%

D-JB: An Online Join Method for Skewed and Varied Data Streams

Wang

Feng

Shi

2018

Intelligence Science II

View full text Add to dashboard Cite

show abstract

“…Koumarelas et al deal with the issue of preprocessing the JM, when selectivity information is known a‐priori. Victorovic et al improves on the previous study for a specific form of JMs, termed as monotonic. Beame et al, Zhang et al, and Cu et al investigate the case where multiple relations are joined in a single step.…”

Section: Related Workmentioning

confidence: 89%

GPU processing of theta‐joins

Bellas

Gounaris

2017

Concurrency and Computation

View full text Add to dashboard Cite

Summary The GPGPU paradigm has recently been employed to accelerate the processing of big amounts of data through the utilization of the massive parallelism offered by modern GPUs. To date, several techniques have been proposed for the implementation of simple select, aggregate, and equality join operations on GPUs. In this paper, we study the efficient implementation of theta‐join queries between two relations using the CUDA framework. Theta‐joins are notoriously slow and thus can benefit from massively parallel execution. However, their GPU‐based implementation significantly differs from hash‐ and sort‐based equality joins and needs to be carefully crafted. The implementation is driven by two main objectives. The first relates to the attainment of high efficiency in the parallelization through data reuse, which relates to the minimization of accesses to the slow global memory. The second is about the most efficient exploitation of the available memory given that, in general, it cannot hold the entire input and result. We propose a methodology for processing theta‐joins on a GPU, which exploits the heterogeneous nature of GPGPU, while addressing memory limitations. Furthermore, we provide a series of implementation optimizations, which yield performance improvements of an order of magnitude.

show abstract

“…Input partitioning was also considered for the more general problem of distributed theta-join computation. Vitorovic et al [38] propose a new tiling algorithm to partition the join matrix in a balanced way, improving over earlier work by Okcan and Riedewald [27]. However, the authors themselves point out that for equi-joins one should instead rely on a specialized solution such as [2], because general theta-join approaches do not exploit the strong structural properties of key-equality based matching in equi-joins.…”

Section: Related Workmentioning

confidence: 99%

Submodularity of Distributed Join Computation

Riedewald

Deng

2018

Proceedings of the 2018 International Conference on Management of Data

View full text Add to dashboard Cite

We study distributed equi-join computation in the presence of join-attribute skew, which causes load imbalance. Skew can be addressed by more fine-grained partitioning, at the cost of input duplication. For random load assignment, e.g., using a hash function, fine-grained partitioning creates a tradeoff between load expectation and variance. We show that minimizing load variance subject to a constraint on expectation is a monotone submodular maximization problem with Knapsack constraints, hence admitting provably near-optimal greedy solutions. In contrast to previous work on formal optimality guarantees, we can prove this result also for self-joins and more general load functions defined as weighted sum of input and output. We further demonstrate through experiments that this theoretical result leads to an effective algorithm for the problem of minimizing running time, even when load is assigned deterministically.

show abstract

Load balancing and skew resilience for parallel joins

Cited by 16 publications

References 28 publications

D-JB: An Online Join Method for Skewed and Varied Data Streams

D-JB: An Online Join Method for Skewed and Varied Data Streams

GPU processing of theta‐joins

Submodularity of Distributed Join Computation

Contact Info

Product

Resources

About