DHTJoin: processing continuous join queries using DHT networks

Palma, Wenceslao; Akbarinia, Reza; Pacitti, Esther; Valduriez, Patrick

doi:10.1007/s10619-009-7054-7

Cited by 8 publications

(10 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This real-time processing of the update stream introduces the interesting challenges related to throughput for join algorithms. Some techniques have been introduced already to process join queries over continuous streaming data (Golab & Özsu, 2003) (Babu & Widom, 2001) (Hammad, Aref, & Elmagarmid, 2008) (Palma, Akbarinia, Pacitti, & Valduriez, 2009) (Kim & Park, 2005) (Nguyen, Brezany, Tjoa, & Weippl, 2005). In this section we will outline the well known work that has already been done in this area with a particular focus on those which are closely related to our problem domain.…”

Section: Related Workmentioning

confidence: 99%

HYBRIDJOIN for Near-Real-Time Data Warehousing

Naeem

Dobbie

Weber

2013

Developments in Data Extraction, Management, and Analysis

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

HYBRIDJOIN for Near-Real-Time Data Warehousing

Naeem

Dobbie

Weber

2013

Developments in Data Extraction, Management, and Analysis

View full text Add to dashboard Cite

show abstract

“…This operator is parallelized as follows (see Figure 3 lines 11-23). Given a subcluster of N nodes to execute the CP operator, each tuple is sent to M = √ N nodes of the destination subcluster (lines [15][16][17][18][19][20][21][22]. Therefore, each load balancer splits its output into M substreams (line 13), according to a hash of the tuples %M (line 14).…”

Section: ) Join Operatormentioning

confidence: 99%

“…There has been recent work on exploiting peer-to-peer networks, in particular, distributed hash tables (DHTs) for processing Continuous multi-way joins over data streams [18] [19]. Although these works exploit hash-based join algorithms, the objective (increasing the size of the sliding window with addition of peers) is different than ours (scaling out) and the assumptions regarding the network (clusters vs. WANs) are very different.…”

Section: Related Workmentioning

confidence: 99%

StreamCloud: A Large Scale Data Streaming System

Gulisano

Jiménez-Peris

Patiño-Martı́nez

et al. 2010

2010 IEEE 30th International Conference on Distributed Computing Systems

Self Cite

View full text Add to dashboard Cite

Abstract-Data streaming has become an important paradigm for the real-time processing of continuous data flows in domains such as finance, telecommunications, networking, . . . Some applications in these domains require to process massive data flows that current technology is unable to manage, that is, streams that, even for a single query operator, require the capacity of potentially many machines. Research efforts on data streaming have mainly focused on scaling in the number of queries or query operators, but overlooked the scalability issue with respect to the stream volume. In this paper, we present StreamCloud a large scale data streaming system for processing large data stream volumes. We focus on how to parallelize continuous queries to obtain a highly scalable data streaming infrastructure. StreamCloud goes beyond the state of the art by using a novel parallelization technique that splits queries into subqueries that are allocated to independent sets of nodes in a way that minimizes the distribution overhead. StreamCloud is implemented as a middleware and is highly independent of the underlying data streaming engine. We explore and evaluate different strategies to parallelize data streaming and tackle with the main bottlenecks and overheads to achieve scalability. The paper presents the system design, implementation and a thorough evaluation of the scalability of the fully implemented system.

show abstract

“…The problem of failures during query processing in distributed data management systems has received a lot of attention [21][22][23]. Palma et al [21] has identified the problem of peer failures while processing join operations over distributed data streams.…”

Section: Existing Work On Reliability and Fault-tolerancementioning

confidence: 99%

“…Palma et al [21] has identified the problem of peer failures while processing join operations over distributed data streams. The approach addresses unnecessary communication and aborts the execution on peers executing subsequent operators of the query if a failure has been detected.…”

Section: Existing Work On Reliability and Fault-tolerancementioning

confidence: 99%

Fault-tolerant query processing in structured P2P-systems

Bestehorn

Weth

Buchmann

et al. 2010

Distrib Parallel Databases

View full text Add to dashboard Cite

Recently, a number of query processors has been proposed for the evaluation of relational queries in structured P2P systems. However, as these approaches do not consider peer or link failures, they cannot be deployed without extensions for real-world applications. We show that typical failures in structured P2P systems can have an unpredictable impact on the correctness of the result. In particular stateful operators that store intermediate results on peers, e.g., the distributed hash join, must protect such results against failures. Although many replication schemes for P2P systems exist, they cannot replicate operator states while the query is processed. In this paper we propose an in-query replication scheme which replicates the state of an operator among the neighbors of the processing peer. Our analytical evaluation shows that the network overhead of the in-query replication is in O(1) regarding network size, i.e., our scheme is scalable. We have carried out an extensive experimental evaluation using simulations as well as a PlanetLab deployment. It confirms the effectiveness and the efficiency of the in-query replication scheme and shows the effectiveness of the routing extension in networks of varying reliability.

show abstract

DHTJoin: processing continuous join queries using DHT networks

Cited by 8 publications

References 38 publications

HYBRIDJOIN for Near-Real-Time Data Warehousing

HYBRIDJOIN for Near-Real-Time Data Warehousing

StreamCloud: A Large Scale Data Streaming System

Fault-tolerant query processing in structured P2P-systems

Contact Info

Product

Resources

About