Optimization for Large-Scale Fuzzy Joins Using Fuzzy Filters in MapReduce

Tran, Thi-To-Quyen; Phan, Thuong-Cang; Laurent, Anne; d’Orazio, Laurent

doi:10.1109/fuzz48607.2020.9177610

Cited by 5 publications

(1 citation statement)

References 40 publications

(42 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Through experimental results, they found that fuzzy join that uses locality-sensitive-hashing signature is significantly faster than a prefix filtering based technique and in case the broadcast fuzzy join is applicable, it is faster than the shuffle version. Tran Thi-To-Quyen et al [35] proposed to integrate the Bloom filter in fuzzy joins to support fast similarity queries in reducing redundant data. The approach was done by maintaining a bit matrix, with a small false positive rate, and zero false negative rate.…”

mentioning

confidence: 99%

Similarity Algorithms for Fuzzy Join Computation in Big Data Processing Environment

Phan

2022

JCC

View full text Add to dashboard Cite

Big data processing is attracting the interest of many researchers to process large-scale datasets and extract useful information for supporting and providing decisions. One of the biggest challenges is the problem of querying large datasets. It becomes even more complicated with similarity queries instead of exact match queries. A fuzzy join operation is a typical operation frequently used in similarity queries and big data analysis. Currently, there is very little research on this issue, thus it poses significant barriers to the efforts of improving query operations on big data efficiently. As a result, this study overviews the similarity algorithms for fuzzy joins, in which the data at the join key attributes may have slight differences within a fuzzy threshold. We analyze six similarity algorithms including Hamming, Levenshtein, LCS, Jaccard, Jaro, and Jaro - Winkler, to show the difference between these algorithms through the three criteria: output enrichment, false positives/negatives, and the processing time of the algorithms. Experiments of fuzzy joins algorithms are implemented in the Spark environment, a popular big data processing platform. The algorithms are divided into two groups for evaluation: group 1 (Hamming, Levenshtein, and LCS) and group 2 (Jaccard, Jaro, and Jaro - Winkler). For the former, Levenshtein has an advantage over the other two algorithms in terms of output enrichment, high accuracy in the result set (false positives/negatives), and acceptable processing time. In the letter, Jaccard is considered the worst algorithm considering all three criteria mean while Jaro - Winkler algorithm has more output richness and higher accuracy in the result set. The overview of the similarity algorithms in this study will help users to choose the most suitable algorithm for their problems.

show abstract

mentioning

confidence: 99%

Similarity Algorithms for Fuzzy Join Computation in Big Data Processing Environment

Phan

2022

JCC

View full text Add to dashboard Cite

show abstract

Neural Networks Based Fuzzy Join Algorithm In Big Data Processing

Phan,

Tran,

Phan

et al. 2024

Communications in Computer and Information Science

View full text Add to dashboard Cite

LSH SimilarityJoin Pattern in FastFlow

Tonci,

Rivault,

Bamha

et al. 2024

Int J Parallel Prog

View full text Add to dashboard Cite

Similarity joins are recognized to be among the most used data processing and analysis operations. We introduce a C++-based high-level parallel pattern implemented on top of FastFlow Building Blocks to provide the programmer with ready-to-use similarity join computations. The SimilarityJoin pattern is implemented according to the MapReduce paradigm enriched with locality sensitive hashing (LSH) to optimize the whole computation. The new parallel pattern can be used with any C++ serializable data structure and executed on shared- and distributed-memory machines. We present experimental validations of the proposed solution considering two different clusters and small and large input datasets to evaluate in-core and out-of-core executions. The performance assessment of the SimilarityJoin pattern has been conducted by comparing the execution time against the one obtained from the original hand-tuned Hadoop-based implementation of the LSH-based similarity join algorithms as well as a Spark-based version. The experiments show that the SimilarityJoin pattern: (1) offers a significant performance improvement for small and medium datasets; (2) is competitive also for computations using large input datasets producing out-of-core executions.

show abstract

Optimization for Large-Scale Fuzzy Joins Using Fuzzy Filters in MapReduce

Cited by 5 publications

References 40 publications

Similarity Algorithms for Fuzzy Join Computation in Big Data Processing Environment

Similarity Algorithms for Fuzzy Join Computation in Big Data Processing Environment

Neural Networks Based Fuzzy Join Algorithm In Big Data Processing

LSH SimilarityJoin Pattern in FastFlow

Contact Info

Product

Resources

About