A new operator for efficient stream-relation join processing in data streaming engines

Derakhshan, Roozbeh; Sattar, Abdul; Stantić, Bela

doi:10.1145/2505515.2505728

Cited by 6 publications

(9 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The algorithm for joining stream data with a disk-based relation by Derakhshan et al [31] uses a cache to store frequent master data tuples and a waiting queue for stream tuples that are not joined through the cache. The algorithm processes this waiting queue in batches.…”

Section: B Index-based Semi-stream Joinmentioning

confidence: 99%

Big Data Velocity Management–From Stream to Warehouse via High Performance Memory Optimized Index Join

et al. 2020

View full text Add to dashboard Cite

Efficient resource optimization is critical to manage the velocity and volume of real-time streaming data in near-real-time data warehousing and business intelligence. This article presents a memory optimisation algorithm for rapidly joining streaming data with persistent master data in order to reduce data latency. Typically during the transformation phase of ETL (Extraction, Transformation, and Loading) a stream of transactional data needs to be joined with master data stored on disk. To implement this process, a semi-stream join operator is commonly used. Most semi-stream join operators cache frequent parts of the master data to improve their performance, this process requires careful distribution of allocated memory among the components of the join operator. This article presents a cache inequality approach to optimise cache size and memory. To test this approach, we present a novel Memory Optimal Index-based Join (MOIJ) algorithm. MOIJ supports many-to-many types of joins and adapts to dynamic streaming data. We also present a cost model for MOIJ and compare the performance with existing algorithms empirically as well as analytically. We envisage the enhanced ability of processing near-real-time streaming data using minimal memory will reduce latency in processing big data and will contribute to the development of highperformance real-time business intelligence systems.

show abstract

Section: B Index-based Semi-stream Joinmentioning

confidence: 99%

Big Data Velocity Management–From Stream to Warehouse via High Performance Memory Optimized Index Join

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Derakhshan et al [3] propose a cache-based method for the join between streaming data and a relation stored in a database under the record-at-a-time model in a centralized environment. So far, little attention has been paid to the stream-relation join processing under the micro-batch model in a distributed environment.…”

Section: Related Workmentioning

confidence: 99%

“…In order to interpret, enrich, and analyze the streaming data, streams need to be joined with data stored in relational or NoSQL databases (e.g., reference tables containing information about users or items) [2], [3]. To get meaningful information about an RFID tag ID, a Stream Processing Engine (SPE) must query the database to get the information about the ID [3]. To resolve shortened URLs in Tweets, an SPE needs to look up the expanded URLs stored in a database [4].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Distributed Join Processing Between Streaming and Stored Big Data Under the Micro-Batch Model

2019

View full text Add to dashboard Cite

In order to interpret, enrich, and analyze the streaming data, stream applications often access the data stored in an external database. Although there has been a lot of studies on stream processing, little attention has been paid so far to the join between streaming data and stored data. In this paper, we propose a comprehensive solution called DS-join for distributed processing of the join under the micro-batch model of recently distributed stream processing engines (SPEs), such as spark streaming. The micro-batch model performs stream processing as a series of very small batch jobs and is more fault-tolerant in a distributed environment compared with the record-at-a-time model. The DS-join reduces the number of database accesses by using micro-batching. Furthermore, the DS-join optimizes the join operation by minimizing the data shuffling, managing a cache in a distributed SPE, parallelizing the join processing, and balancing the load between the SPE and the external database system. The experimental results using real and synthetic datasets show that, compared with the state-of-the-art methods, the DS-join significantly improves throughput, especially for large databases.INDEX TERMS Micro-batch model, distributed stream processing engine, database system, distributed join processing, cache management, spark streaming.

show abstract

“…Stream relation join: Prior work on optimization of streamrelation joins for non-distributed streaming systems includes MeshJoin [20], Semi-Streaming Index Join (SSIJ) [2], CacheJoin [18], and a technique proposed by Derakhshan et al in [8].…”

Section: Related Workmentioning

confidence: 99%

Runtime optimization of join location in parallel data management systems

Chandra

Sudarshan

2017

Proc. VLDB Endow.

View full text Add to dashboard Cite

Applications running on parallel systems often need to join a streaming relation or a stored relation with data indexed in a parallel data storage system. Some applications also compute UDFs on the joined tuples. The join can be done at the data storage nodes, corresponding to reduce side joins, or by fetching data from the storage system to compute nodes, corresponding to map side join. Both may be suboptimal: reduce side joins may cause skew, while map side joins may lead to a lot of data being transferred and replicated.In this paper, we present techniques to make runtime decisions between the two options on a per key basis, in order to improve the throughput of the join, accounting for UDF computation if any. Our techniques are based on an extended ski-rental algorithm and provide worst-case performance guarantees with respect to the optimal point in the space considered by us. Our techniques use load balancing taking into account the CPU, network and I/O costs as well as the load on compute and storage nodes. We have implemented our techniques on Hadoop, Spark and the Muppet stream processing engine. Our experiments show that our optimization techniques provide a significant improvement in throughput over existing techniques.

show abstract

A new operator for efficient stream-relation join processing in data streaming engines

Cited by 6 publications

References 11 publications

Big Data Velocity Management–From Stream to Warehouse via High Performance Memory Optimized Index Join

Big Data Velocity Management–From Stream to Warehouse via High Performance Memory Optimized Index Join

Distributed Join Processing Between Streaming and Stored Big Data Under the Micro-Batch Model

Runtime optimization of join location in parallel data management systems

Contact Info

Product

Resources

About