Similarity join on large-scale high-dimensional data faces major challenges because of the data scale and the cure of dimensionality. Random projection with p-stable distribution can reduce the high-dimensional data form d-dimension to k-dimension (k ≪ d), the distance of the data in k-dimensional space can be used to filter out as many data pairs as possible at relative low cost. Based on the above idea, we proposed two novel approaches to deal with large-scale high-dimensional data similarity join: projection-based similarity join (PromSimJ) algorithm and projection space partitioning-based similarity join (ProSPSimJ) algorithm. The comprehensive experiments were performed to test the performance of the above methods. We also compared the performance of the above methods with that of the naive method block nested loop join.The final experimental results prove that our approaches have much better performance and good scalability.
KEYWORDShigh-dimensional data, p-stable distribution, random projection, similarity join
INTRODUCTIONSimilarity join query (SJQ) aims to find out all the similar data pairs whose similarity is no less than the given similarity threshold (or whose distance is no more than the given distance threshold). As one of the hot research topics about big data analysis, SJQ has been widely used in many similarity search and data mining applications, such as duplicate web pages detection, 1 personalized recommendation, 2 trajectory clustering, 3 image classification, 4 and so on. Taking detection of duplicate web pages for example, as the number of the web pages increases, duplicate web pages will appear because of human reasons. To detect the duplicate web pages, each web page can be first translated into a high-dimensional vector after processing; then, calculating the distance between each pair of vectors, if the distance of one pair of vectors is less than the given distance threshold, they can be considered duplicate. The distance calculation is a time-costly operation because of the large number of web pages and the high dimensionality of the web-page vector. There have been many researches about SJQ, but some big challenges still exist when dealing with SJQ on large-scale high-dimensional data. As the dimensionality increases, the traditional filtering schemes based on tree-like index or space partitioning do not work. When the dimensionality is bigger than some threshold, the performance of the tree-like index 5 is perhaps Concurrency Computat Pract Exper. 2019;31:e5303. wileyonlinelibrary.com/journal/cpe