2021
DOI: 10.48550/arxiv.2106.01501
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins

Sahaana Suri,
Ihab F. Ilyas,
Christopher Ré
et al.

Abstract: Structured data, or data that adheres to a pre-defined schema, can suffer from fragmented context: information describing a single entity can be scattered across multiple datasets or tables tailored for specific business needs, with no explicit linking keys (e.g., primary key-foreign key relationships or heuristic functions). Context enrichment, or rebuilding fragmented context, using keyless joins is an implicit or explicit step in machine learning (ML) pipelines over structured data sources. This process is … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
5
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 36 publications
0
5
0
Order By: Relevance
“…Techniques for data integration [9], [10], [11], [36], [39], [40], [41], [42] generally aim to automatically discover, select and aggregate related data in order to extend a given dataset. Many of the approaches deal with tabular data.…”
Section: Data Integrationmentioning
confidence: 99%
See 1 more Smart Citation
“…Techniques for data integration [9], [10], [11], [36], [39], [40], [41], [42] generally aim to automatically discover, select and aggregate related data in order to extend a given dataset. Many of the approaches deal with tabular data.…”
Section: Data Integrationmentioning
confidence: 99%
“…Neural style transfer [6], generative modeling techniques such as variational autoencoders (VAEs) [7] and generative adversarial networks (GANs) [8] have also been extensively used to generate synthetic data for training deep learning models. Another way to augment training data is by integrating existing data from several sources (e.g., in [9], [10], [11]). This is a useful way to leverage the large quantities of data available in various forms on the internet and other sources.…”
mentioning
confidence: 99%
“…While larger language models have significantly increased the accuracy on that task, they also enable entirely new applications. Here, the tutorial will cover recent research leveraging language models for tasks such as data preparation and integration [2,74,75], fact checking from data [10, 25, 33-40, 81, 82], or database tuning [78][79][80][85][86][87].…”
Section: Applications In Data Managementmentioning
confidence: 99%
“…Specifically, the tutorial will cover novel ways of representing data using language models (e.g., by storing data as natural language facts [77] or by integrating data within the language model [26]). Also, it will discuss the use of language models in the execution engine (e.g., to implement operators [74,77] or to synthesize code for data processing [84]).…”
Section: Applications In Data Managementmentioning
confidence: 99%
“…CodexDB relates to prior work exploiting machine learning [6,7,13] and specifically Transformers [20,21] in the context of database systems. It connects broadly to prior work using GPT-3 for program synthesis [5,11,12].…”
Section: Background and Related Workmentioning
confidence: 99%