A Density-Aware Similarity Join Query Processing Algorithm on MapReduce

Jang, Miyoung; Song, Youngho; Chang, Jae-Woo

doi:10.1007/978-981-10-1536-6_61

Cited by 2 publications

(2 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Ma et al [25] proposed a multi-PAA based similarity join approach called MP-V-SJQ which can further increase the filtering effect and reduce the filtering cost on the basis of SAX-Based HDSJ [24]. In order to reduce unnecessary comparisons and achieve load balancing among computing nodes, Grid-Based SJ [26] proposed a similarity join approach based on dynamic grid partition.…”

Section: B Vector Similarity Joinmentioning

confidence: 99%

Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework

Zhang

Cui

et al. 2020

IEEE Access

View full text Add to dashboard Cite

Similarity join has been widely used in many data analysis and data mining applications, we mainly focus on the scalability and performance problem of similarity join query on massive highdimensional data set. p-stable distribution based projection scheme can implement dimension reduction effectively. Three novel approaches based on projection scheme are proposed to deal with massive highdimensional data similarity join problem: Single projection method, Multiple projection method and Projection space partitioning method. Comprehensive experimental tests were performed to evaluate the performance of the above approaches. The experimental results show that the proposed approaches in this paper have good performance and scalability.

show abstract

Section: B Vector Similarity Joinmentioning

confidence: 99%

Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework

Zhang

Cui

et al. 2020

IEEE Access

View full text Add to dashboard Cite

show abstract

“…It first sorts the approximately repeating images so that the top ranked image is bubbled to the repeated search results, so no further refinement steps are required. In the work of Jang et al, the authors proposed a similarity join query processing algorithm by dividing the grid data and constructing a dynamic index. The data is evenly distributed in different domains by sampling, and then the data is assigned to different map and reduce calculations by constructing a dynamic mesh index.…”

Section: Related Workmentioning

confidence: 99%

Similarity joins for high‐dimensional data using Spark

Rong

Cheng

Chen

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Similarity join on high-dimensional data is a primitive operation. It is used to find all data pairs that with distance no more than from the given data set according to a specific distance measure.As the data set scale and dimension increase, computation cost increases vastly. Hadoop and Spark have become the popular platforms for big-data analysis. Because Spark has native advantages in iterative computations, we adopted it as our platform to perform similarity joins on high-dimensional data sets. In order to resolve problems such as data imbalance, data duplication, and redundant computation of existing works, we have proposed a new algorithm based on Symbolic aggregation and vertical decomposition. We first conduct dimension-reduction using symbolic aggregation method. Then, we applied vertical partition operation on processed data.The join operations are performed on each vertical partition in parallel manner and the proposed new filters are utilized to prune false positives in early stage. Finally, the partial results generated from each partition will be aggregated and verified to get final results. Our proposed algorithm can significantly improve the efficiency of similarity joins on high-dimensional data. In order to verify the efficiency and scalability of our methods, we implemented it using MapReduce and Spark. We compared our methods with existing works on public data sets, and the experimental results showed that the new methods were more efficient and scalable under different running environments. KEYWORDShigh-dimensional data, piecewise aggregation, similarity join, symbolic aggregation, Spark, vertical partition INTRODUCTIONIn this era of big data, data-acquisition occurs ever more quickly, the scale of data is increasing rapidly, and the types of data are complex and diverse. This brings new challenges to data analysis and processing. As a basic operation, the similarity join has been applied widely in many fields, such as friend recommendations, 1 pattern recognition, 2 clustering, 3 image similarity matching, 4 outlier detection, 5 and spatial databases. 6 A similarity join is essentially a pair of comparisons. It has high computationally complexity and is intensive. Data-processing time increases exponentially with data volume. To improve the execution efficiency of algorithms, more efficient methods are needed to reduce unnecessary operations in large-scale data processing. Most traditional algorithms use a spatial index, such as a B+tree, R-tree, or z-order curve, to improve the performance of a similarity join, but traditional algorithms do not apply to large-scale, high-dimensional dalg:1ata sets. The current MapReduce 7 framework based on Hadoop has emerged as the primary choice for big-data processing. MapReduce is a programming model that can easily develop scalable parallel applications for large-scale data. For computationally intensive operations such as a similarity join, scholars recently have proposed parallel knn-join algorithms using MapReduce, such as H-BNLJ, H-BRJ, 8 and PGBJ...

show abstract

A Density-Aware Similarity Join Query Processing Algorithm on MapReduce

Cited by 2 publications

References 7 publications

Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework

Projection Based Large Scale High-Dimensional Data Similarity Join Using MapReduce Framework

Similarity joins for high‐dimensional data using Spark

Contact Info

Product

Resources

About