Similarity join has become very important for semi-or un-structured data processing and analysis. Although several studies have been conducted on the similarity join, little attention has been paid to a semi-stream similarity join, which is a similarity join between stream data and a large diskbased relation. In this study, we propose the first distributed solution called DSim-Join for semi-stream similarity join problem. DSim-Join minimizes the data transmission, reduces database accesses using a cache in a distributed stream processing engine, parallelizes join processing, and balances the load between parallel join threads. Experimental results obtained using real-world datasets show that DSim-Join yields significantly improved throughput compared with state-of-the-art methods, especially for large datasets. The results also show that DSim-Join is scalable and stable; it is not very sensitive to the parameters such as the micro-batch interval, checkpoint interval, and similarity threshold. INDEX TERMS semi-stream join, similarity join, distributed stream processing engine, database system, big data, distributed join processing, cache management, Spark Streaming.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.