Customizable and scalable fuzzy join for big data

Chen, Zhimin; Wang, Yue; Narasayya, Vivek; Chaudhuri, Surajit

doi:10.14778/3352063.3352128

Cited by 9 publications

(5 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Essentially, aside from the matching component, table union search needs to address the additional problem of identifying matching candidates among many non-matching ones, which is a significantly more challenging setup. Similarly, fuzzy join [16], [17] assumes a restrictive setup with a pair of input datasets. In their experiments, the second dataset in the pair is usually a syntactically perturbed variant of the first dataset and thus cannot mimic the complexity of data lakes consisting of heterogeneous datasets across domains.…”

Section: Definition 3 (Table Union-ability) Following Notations Inmentioning

confidence: 99%

Pylon: Semantic Table Union Search in Data Lakes

Cong¹,

Nargesian²,

Jagadish³

2023

Preprint

View full text Add to dashboard Cite

The large size and fast growth of data repositories, such as data lakes, has spurred the need for data discovery to help analysts find related data. The problem has become challenging as (i) a user typically does not know what datasets exist in an enormous data repository; and (ii) there is usually a lack of a unified data model to capture the interrelationships between heterogeneous datasets from disparate sources. In this work, we address one important class of discovery needs: finding unionable tables.The task is to find tables in a data lake that can be unioned with a given query table. The challenge is to recognize unionable columns even if they are represented differently. In this paper, we propose a data-driven learning approach: specifically, an unsupervised representation learning and embedding retrieval task. Our key idea is to exploit self-supervised contrastive learning to learn an embedding model that takes into account the indexing/search data structure and produces embeddings close by for columns with semantically similar values while pushing apart columns with semantically dissimilar values. We then find union-able tables based on similarities between their constituent columns in embedding space. On a real-world data lake, we demonstrate that our best-performing model achieves significant improvements in precision (16% ↑), recall (17% ↑), and query response time (7x faster) compared to the state-of-the-art.

show abstract

Section: Definition 3 (Table Union-ability) Following Notations Inmentioning

confidence: 99%

Pylon: Semantic Table Union Search in Data Lakes

Cong¹,

Nargesian²,

Jagadish³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Fuzzy Joining. A fuzzy join identifies data point pairs that are similar to one another across two database tables, where similar records may be identified with respect to a similarity function (e.g., cosine similarity, Jaccard similarity, edit distance) and threshold [15,59] defined over a subset of key columns. Though related to entity matching, fuzzy joins can be viewed as a primitive used to efficiently mine and block pairs that are similar across the two data sources.…”

Section: Motivating Applicationsmentioning

confidence: 99%

“…We build two workloads using a dataset and generation procedure from a 2019 scalable fuzzy join VLDB paper [15]. The first dataset consists of the Title, Year, and Genre columns from IMDb [5].…”

Section: Fuzzy Join (Fj)mentioning

confidence: 99%

See 1 more Smart Citation

Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins

Suri,

Ilyas,

Ré

et al. 2021

Preprint

View full text Add to dashboard Cite

Structured data, or data that adheres to a pre-defined schema, can suffer from fragmented context: information describing a single entity can be scattered across multiple datasets or tables tailored for specific business needs, with no explicit linking keys (e.g., primary key-foreign key relationships or heuristic functions). Context enrichment, or rebuilding fragmented context, using keyless joins is an implicit or explicit step in machine learning (ML) pipelines over structured data sources. This process is tedious, domain-specific, and lacks support in now-prevalent no-code ML systems that let users create ML pipelines using just input data and high-level configuration files. In response, we propose Ember, a system that abstracts and automates keyless joins to generalize context enrichment. Our key insight is that Ember can enable a general keyless join operator by constructing an index populated with task-specific embeddings. Ember learns these embeddings by leveraging Transformer-based representation learning techniques. We describe our core architectural principles and operators when developing Ember, and empirically demonstrate that Ember allows users to develop nocode pipelines for five domains, including search, recommendation and question answering, and can exceed alternatives by up to 39% recall, with as little as a single line configuration change.

show abstract

“…For relatively small datasets, running on multi-core systems provides good scalability and avoids the overhead of network communication. Zhimin Chen et al [13] described a scale-out fuzzy join operator that supports customization with a locality-sensitive-hashing based signature scheme. The evaluation of the design was done on the Azure Databricks version of Spark using several real-world and synthetic datasets.…”

mentioning

confidence: 99%

Similarity Algorithms for Fuzzy Join Computation in Big Data Processing Environment

Phan

2022

JCC

View full text Add to dashboard Cite

Big data processing is attracting the interest of many researchers to process large-scale datasets and extract useful information for supporting and providing decisions. One of the biggest challenges is the problem of querying large datasets. It becomes even more complicated with similarity queries instead of exact match queries. A fuzzy join operation is a typical operation frequently used in similarity queries and big data analysis. Currently, there is very little research on this issue, thus it poses significant barriers to the efforts of improving query operations on big data efficiently. As a result, this study overviews the similarity algorithms for fuzzy joins, in which the data at the join key attributes may have slight differences within a fuzzy threshold. We analyze six similarity algorithms including Hamming, Levenshtein, LCS, Jaccard, Jaro, and Jaro - Winkler, to show the difference between these algorithms through the three criteria: output enrichment, false positives/negatives, and the processing time of the algorithms. Experiments of fuzzy joins algorithms are implemented in the Spark environment, a popular big data processing platform. The algorithms are divided into two groups for evaluation: group 1 (Hamming, Levenshtein, and LCS) and group 2 (Jaccard, Jaro, and Jaro - Winkler). For the former, Levenshtein has an advantage over the other two algorithms in terms of output enrichment, high accuracy in the result set (false positives/negatives), and acceptable processing time. In the letter, Jaccard is considered the worst algorithm considering all three criteria mean while Jaro - Winkler algorithm has more output richness and higher accuracy in the result set. The overview of the similarity algorithms in this study will help users to choose the most suitable algorithm for their problems.

show abstract

Customizable and scalable fuzzy join for big data

Cited by 9 publications

References 20 publications

Pylon: Semantic Table Union Search in Data Lakes

Pylon: Semantic Table Union Search in Data Lakes

Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins

Similarity Algorithms for Fuzzy Join Computation in Big Data Processing Environment

Contact Info

Product

Resources

About