2019
DOI: 10.14778/3352063.3352128
|View full text |Cite
|
Sign up to set email alerts
|

Customizable and scalable fuzzy join for big data

Abstract: Fuzzy join is an important primitive for data cleaning. The ability to customize fuzzy join is crucial to allow applications to address domain-specific data quality issues such as synonyms and abbreviations. While efficient indexing techniques exist for single-node implementations of customizable fuzzy join, the state-of-the-art scale-out techniques do not support customization, and exhibit poor performance and scalability characteristics. We describe the design of a scaleout fuzzy join operator that supports … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(5 citation statements)
references
References 20 publications
0
5
0
Order By: Relevance
“…Essentially, aside from the matching component, table union search needs to address the additional problem of identifying matching candidates among many non-matching ones, which is a significantly more challenging setup. Similarly, fuzzy join [16], [17] assumes a restrictive setup with a pair of input datasets. In their experiments, the second dataset in the pair is usually a syntactically perturbed variant of the first dataset and thus cannot mimic the complexity of data lakes consisting of heterogeneous datasets across domains.…”
Section: Definition 3 (Table Union-ability) Following Notations Inmentioning
confidence: 99%
“…Essentially, aside from the matching component, table union search needs to address the additional problem of identifying matching candidates among many non-matching ones, which is a significantly more challenging setup. Similarly, fuzzy join [16], [17] assumes a restrictive setup with a pair of input datasets. In their experiments, the second dataset in the pair is usually a syntactically perturbed variant of the first dataset and thus cannot mimic the complexity of data lakes consisting of heterogeneous datasets across domains.…”
Section: Definition 3 (Table Union-ability) Following Notations Inmentioning
confidence: 99%
“…Fuzzy Joining. A fuzzy join identifies data point pairs that are similar to one another across two database tables, where similar records may be identified with respect to a similarity function (e.g., cosine similarity, Jaccard similarity, edit distance) and threshold [15,59] defined over a subset of key columns. Though related to entity matching, fuzzy joins can be viewed as a primitive used to efficiently mine and block pairs that are similar across the two data sources.…”
Section: Motivating Applicationsmentioning
confidence: 99%
“…We build two workloads using a dataset and generation procedure from a 2019 scalable fuzzy join VLDB paper [15]. The first dataset consists of the Title, Year, and Genre columns from IMDb [5].…”
Section: Fuzzy Join (Fj)mentioning
confidence: 99%
See 1 more Smart Citation
“…For relatively small datasets, running on multi-core systems provides good scalability and avoids the overhead of network communication. Zhimin Chen et al [13] described a scale-out fuzzy join operator that supports customization with a locality-sensitive-hashing based signature scheme. The evaluation of the design was done on the Azure Databricks version of Spark using several real-world and synthetic datasets.…”
mentioning
confidence: 99%