2017 IEEE 33rd International Conference on Data Engineering (ICDE) 2017
DOI: 10.1109/icde.2017.151
|View full text |Cite
|
Sign up to set email alerts
|

Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
14
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 36 publications
(15 citation statements)
references
References 15 publications
1
14
0
Order By: Relevance
“…The process is performed in a single MapReduce job. FS-Join [129] sorts the tokens in each set in increasing order of frequency, and then splits each set into disjoint subsets using appropriate pivot tokens. These subsets are then grouped together so that subsets from different groups are non-overlapping.…”
Section: Distributed Algorithmsmentioning
confidence: 99%
“…The process is performed in a single MapReduce job. FS-Join [129] sorts the tokens in each set in increasing order of frequency, and then splits each set into disjoint subsets using appropriate pivot tokens. These subsets are then grouped together so that subsets from different groups are non-overlapping.…”
Section: Distributed Algorithmsmentioning
confidence: 99%
“…As MapReduce [15] engines such as Hadoop started becoming used for ETL workloads, several new scale out fuzzy join techniques for MapReduce were proposed such as [18, 8,33,27,31,13,16,17,30]. Most of these techniques retained the overall approach used in single-node techniques: signature-based identification of candidate pairs followed by a verification step.…”
Section: Related Workmentioning
confidence: 99%
“…Once the reference table R becomes large, a single-node solution is no longer feasible. Therefore, scale-out approaches including [18,39,8,33,27,31,13,16,17,30] have been developed on MapReduce [15] engines such as Hadoop. A recent experimental study by Fier et al [19] compared several scale-out techniques.…”
Section: Introductionmentioning
confidence: 99%
“…Some of these contributions focus on performing join operations using distributed and parallel platforms. Several works [22][23][24][25] have explored the efficient way to perform set similarity joins. The parallel theta join using MapReduce to join two data sets like in relational databases is explored by Okcan and Riedewald 26 and by Zhang et al 26,27 Similarity joins on high dimensional data using Spark are studied exploiting data representation and vertical partition techniques.…”
Section: Related Workmentioning
confidence: 99%