Scalable and adaptive online joins

Elseidy, Mohammed; Elguindy, Abdallah; Vitorovic, Aleksandar; Koch, Christoph

doi:10.14778/2732279.2732281

Cited by 70 publications

(45 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast, stream processing requires to be real-time, a challenge that has drawn increasing attention from researchers [2,14,21,11]. Nevertheless, stream data cleaning approaches are still in their infancy.…”

Section: Related Workmentioning

confidence: 99%

Bleach: A Distributed Stream Data Cleaning System

Tian

Michiardi

Vukolić

2017

2017 IEEE International Congress on Big Data (BigData Congress)

View full text Add to dashboard Cite

In this paper we address the problem of rule-based stream data cleaning, which sets stringent requirements on latency, rule dynamics and ability to cope with the unbounded nature of data streams.We design a system, called Bleach, which achieves realtime violation detection and data repair on a dirty data stream. Bleach relies on efficient, compact and distributed data structures to maintain the necessary state to repair data, using an incremental version of the equivalence class algorithm. Additionally, it supports rule dynamics and uses a "cumulative" sliding window operation to improve cleaning accuracy.We evaluate a prototype of Bleach using a TPC-DS derived dirty data stream and observe its high throughput, low latency and high cleaning accuracy, even with rule dynamics. Experimental results indicate superior performance of Bleach compared to a baseline system built on the microbatch streaming paradigm.

show abstract

Section: Related Workmentioning

confidence: 99%

Bleach: A Distributed Stream Data Cleaning System

Tian

Michiardi

Vukolić

2017

2017 IEEE International Congress on Big Data (BigData Congress)

View full text Add to dashboard Cite

show abstract

“…We refer to a set of cells (that is, the corresponding input tuples) assigned to a single machine for local processing as a region. We adhere to rectangular regions, as opposed to rectilinear or non-contiguous regions, to incur minimal storage and communication costs [9].…”

Section: Background and Preliminariesmentioning

confidence: 99%

“…The content-insensitive partitioning scheme, CI (called 1-Bucket in [4], [9]), illustrated in Figure 1b, assigns all cells (n 2 of them) to machines, regardless of the join condition. Thus, regions cover the entire join matrix.…”

Section: A Content-insensitive Partitioning Schemementioning

confidence: 99%

Load balancing and skew resilience for parallel joins

Vitorovic

Elseidy

2016

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

Self Cite

View full text Add to dashboard Cite

Abstract-We address the problem of load balancing for parallel joins. We show that the distribution of input data received and the output data produced by worker machines are both important for performance. As a result, previous work, which optimizes either for input or output, stands ineffective for load balancing. To that end, we propose a multi-stage load-balancing algorithm which considers the properties of both input and output data through sampling of the original join matrix. To do this efficiently, we propose a novel category of equi-weight histograms. To build them, we exploit state-of-the-art computational geometry algorithms for rectangle tiling. To our knowledge, we are the first to employ tiling algorithms for join load-balancing. In addition, we propose a novel, join-specialized tiling algorithm that has drastically lower time and space complexity than existing algorithms. Experiments show that our scheme outperforms stateof-the-art techniques by up to a factor of 15.

show abstract

“…To overcome this problem, a plethora of Adaptive Query Processing (AQP) techniques have been recently proposed in the literature aiming to adapt the runtime query plan in respond to changes in the execution environment or the characteristics of the streaming data [4][5][6][7][8]. The rationale followed by these AQP techniques can be condensed into a three-phase procedure, called adaptivity loop [9].…”

Section: Main Textmentioning

confidence: 99%

“…First, the Adjust feedback function is called after a change is detected, instead of Initialize (line 24). Second, if the query plan is reoptimized, the monitoring phase calls the Initialize function of the change detection algorithm, so as the change detection algorithm to entirely forget its runtime state (lines [5][6][7][8]. Recall that after the Initialize state, a "feedback-full" algorithm must collect feedback prior to be operational.…”

Section: The Novel Monitoring Phasementioning

confidence: 99%

Incorporating change detection in the monitoring phase of adaptive query processing

Tsamoura

Gounaris

Manolopoulos

2016

J Internet Serv Appl

View full text Add to dashboard Cite

Recent Big Data research typically emphasizes on the need to address the challenges stemming from the volume, velocity, variety and veracity aspects. However, another cross-cutting property of Big Data is volatility. In database technology, volatility is addressed with the help of adaptive query processing (AQP), which has become the dominant paradigm for executing queries in dynamic and/or streaming environments. As the characteristics of the runtime environment may vary significantly along time, AQP techniques employ a three-phase adaptivity loop to process the input queries, comprising feedback collection, analysis and re-optimization. In the monitoring phase, the standard approach is to collect feedback in a fixed-size sliding window. However, several problems arise when the techniques adopt a fixed-size sliding window for maintaining runtime collected feedback. In this work, we tackle this limitation and we propose a novel monitoring phase, which assesses the collected feedback rendering an AQP technique capable of taking more informed decisions during the subsequent phases. The proposed approach is non-intrusive to the state-of-the-art adaptivity loop and can adopt any state-of-the-art online change detection algorithm through its plug-and-play abstraction. Another contribution of this work is a novel algorithm for detecting changes in a filter's drop probability, called β-CUSUM. The potential of the novel monitoring phase and of β-CUSUM is experimentally evaluated using both real-world and synthetic datasets.

show abstract

Scalable and adaptive online joins

Cited by 70 publications

References 40 publications

Bleach: A Distributed Stream Data Cleaning System

Bleach: A Distributed Stream Data Cleaning System

Load balancing and skew resilience for parallel joins

Incorporating change detection in the monitoring phase of adaptive query processing

Contact Info

Product

Resources

About