Similarity join on high-dimensional data is a primitive operation. It is used to find all data pairs that with distance no more than from the given data set according to a specific distance measure.As the data set scale and dimension increase, computation cost increases vastly. Hadoop and Spark have become the popular platforms for big-data analysis. Because Spark has native advantages in iterative computations, we adopted it as our platform to perform similarity joins on high-dimensional data sets. In order to resolve problems such as data imbalance, data duplication, and redundant computation of existing works, we have proposed a new algorithm based on Symbolic aggregation and vertical decomposition. We first conduct dimension-reduction using symbolic aggregation method. Then, we applied vertical partition operation on processed data.The join operations are performed on each vertical partition in parallel manner and the proposed new filters are utilized to prune false positives in early stage. Finally, the partial results generated from each partition will be aggregated and verified to get final results. Our proposed algorithm can significantly improve the efficiency of similarity joins on high-dimensional data. In order to verify the efficiency and scalability of our methods, we implemented it using MapReduce and Spark. We compared our methods with existing works on public data sets, and the experimental results showed that the new methods were more efficient and scalable under different running environments.
KEYWORDShigh-dimensional data, piecewise aggregation, similarity join, symbolic aggregation, Spark, vertical partition
INTRODUCTIONIn this era of big data, data-acquisition occurs ever more quickly, the scale of data is increasing rapidly, and the types of data are complex and diverse. This brings new challenges to data analysis and processing. As a basic operation, the similarity join has been applied widely in many fields, such as friend recommendations, 1 pattern recognition, 2 clustering, 3 image similarity matching, 4 outlier detection, 5 and spatial databases. 6 A similarity join is essentially a pair of comparisons. It has high computationally complexity and is intensive. Data-processing time increases exponentially with data volume. To improve the execution efficiency of algorithms, more efficient methods are needed to reduce unnecessary operations in large-scale data processing. Most traditional algorithms use a spatial index, such as a B+tree, R-tree, or z-order curve, to improve the performance of a similarity join, but traditional algorithms do not apply to large-scale, high-dimensional dalg:1ata sets. The current MapReduce 7 framework based on Hadoop has emerged as the primary choice for big-data processing. MapReduce is a programming model that can easily develop scalable parallel applications for large-scale data. For computationally intensive operations such as a similarity join, scholars recently have proposed parallel knn-join algorithms using MapReduce, such as H-BNLJ, H-BRJ, 8 and PGBJ...