Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics

Rong, Chuitian; Lin, Chunbin; Silva, Yasin N.; Wang, Jianguo; Lü, Wei; Du, Xiaoyong

doi:10.1109/icde.2017.151

Cited by 36 publications

(15 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The process is performed in a single MapReduce job. FS-Join [129] sorts the tokens in each set in increasing order of frequency, and then splits each set into disjoint subsets using appropriate pivot tokens. These subsets are then grouped together so that subsets from different groups are non-overlapping.…”

Section: Distributed Algorithmsmentioning

confidence: 99%

Blocking and Filtering Techniques for Entity Resolution

et al. 2020

View full text Add to dashboard Cite

Entity Resolution (ER), a core task of Data Integration, detects different entity profiles that correspond to the same real-world object. Due to its inherently quadratic complexity, a series of techniques accelerate it so that it scales to voluminous data. In this survey, we review a large number of relevant works under two different but related frameworks: Blocking and Filtering. The former restricts comparisons to entity pairs that are more likely to match, while the latter identifies quickly entity pairs that are likely to satisfy predetermined similarity thresholds. We also elaborate on hybrid approaches that combine different characteristics. For each framework we provide a comprehensive list of the relevant works, discussing them in the greater context. We conclude with the most promising directions for future work in the field.

show abstract

Section: Distributed Algorithmsmentioning

confidence: 99%

Blocking and Filtering Techniques for Entity Resolution

et al. 2020

View full text Add to dashboard Cite

show abstract

“…As MapReduce [15] engines such as Hadoop started becoming used for ETL workloads, several new scale out fuzzy join techniques for MapReduce were proposed such as [18, 8,33,27,31,13,16,17,30]. Most of these techniques retained the overall approach used in single-node techniques: signature-based identification of candidate pairs followed by a verification step.…”

Section: Related Workmentioning

confidence: 99%

“…Once the reference table R becomes large, a single-node solution is no longer feasible. Therefore, scale-out approaches including [18,39,8,33,27,31,13,16,17,30] have been developed on MapReduce [15] engines such as Hadoop. A recent experimental study by Fier et al [19] compared several scale-out techniques.…”

Section: Introductionmentioning

confidence: 99%

Customizable and scalable fuzzy join for big data

et al. 2019

View full text Add to dashboard Cite

Fuzzy join is an important primitive for data cleaning. The ability to customize fuzzy join is crucial to allow applications to address domain-specific data quality issues such as synonyms and abbreviations. While efficient indexing techniques exist for single-node implementations of customizable fuzzy join, the state-of-the-art scale-out techniques do not support customization, and exhibit poor performance and scalability characteristics. We describe the design of a scaleout fuzzy join operator that supports customization. We use a locality-sensitive-hashing (LSH) based signature scheme, and introduce optimizations that result in significant speed up with negligible impact on recall. We evaluate our implementation on the Azure Databricks version of Spark using several real-world and synthetic data sets. We observe speedups exceeding 50X compared to the best-known prior scale-out technique, and close to linear scalability with data size and number of nodes.

show abstract

“…Some of these contributions focus on performing join operations using distributed and parallel platforms. Several works [22][23][24][25] have explored the efficient way to perform set similarity joins. The parallel theta join using MapReduce to join two data sets like in relational databases is explored by Okcan and Riedewald 26 and by Zhang et al 26,27 Similarity joins on high dimensional data using Spark are studied exploiting data representation and vertical partition techniques.…”

Section: Related Workmentioning

confidence: 99%

Parallel time series join using spark

Rong

Chen

Silva

2019

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

Summary A time series is a sequence of data points in successive temporal order. Time series data is produced in many applications scenarios, and the techniques for its analysis have generated substantial interest. Time series join is a primitive operation that retrieves all pairs of correlated subsequences from two given time series. As the Pearson correlation coefficient, a measure of the correlation between two variables, has multiple beneficial mathematical properties, for example, the fact that it is invariant with respect to scale and offset, it is used to measure the correlation between two time series. Considering the need to analyze big time series data, we focus on the study of scalable and distributed techniques to process massive data sets. Specifically, we propose a parallel approach to perform time series joins using Spark, a popular analytics engine for large‐scale data processing. Our solution builds on (1) a fast method to compute the fast Fourier transform on the times series to calculate the correlation between two time series, (2) a lossless partition method to divide the time series into multiple subsequences and enable a parallel and correct computation of the join result, and (3) optimization techniques to avoid redundant computations. We performed extensive tests and showed that the proposed approach is efficient and scalable across different data sets and test configurations.

show abstract

Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics

Cited by 36 publications

References 15 publications

Blocking and Filtering Techniques for Entity Resolution

Blocking and Filtering Techniques for Entity Resolution

Customizable and scalable fuzzy join for big data

Parallel time series join using spark

Contact Info

Product

Resources

About