A Multi-way Semi-stream Join for a Near-Real-Time Data Warehouse

Naeem, M. Asif; Nguyen, Kim Tung; Weber, Gerald

doi:10.1007/978-3-319-68155-9_5

Cited by 2 publications

(2 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Content may change prior to final publication. [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16] Hadoop no Similarity [17], [18], [19] Spark no Similarity [20], [21], [22], [23], [24], [25], [26], [27] N/A yes Equi [28] N/A yes Similarity [29] Spark yes Equi DSim-Join Spark yes Similarity…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Semi-Stream Similarity Join Processing in a Distributed Environment

Kim

Lee

2020

IEEE Access

View full text Add to dashboard Cite

Similarity join has become very important for semi-or un-structured data processing and analysis. Although several studies have been conducted on the similarity join, little attention has been paid to a semi-stream similarity join, which is a similarity join between stream data and a large diskbased relation. In this study, we propose the first distributed solution called DSim-Join for semi-stream similarity join problem. DSim-Join minimizes the data transmission, reduces database accesses using a cache in a distributed stream processing engine, parallelizes join processing, and balances the load between parallel join threads. Experimental results obtained using real-world datasets show that DSim-Join yields significantly improved throughput compared with state-of-the-art methods, especially for large datasets. The results also show that DSim-Join is scalable and stable; it is not very sensitive to the parameters such as the micro-batch interval, checkpoint interval, and similarity threshold. INDEX TERMS semi-stream join, similarity join, distributed stream processing engine, database system, big data, distributed join processing, cache management, Spark Streaming.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Mehmood and Naeem [22] used parallel execution between the stream-probing phase and the disk-probing phase by introducing an intermediate buffer between the two phases. Naeem, Nguyen, and Weber [23] proposed multi-way semi-stream join methods. Naeem et al [24] presented a technique for load shedding in semistream join processing.…”

Section: Related Workmentioning

confidence: 99%