Proceedings of the 2021 International Conference on Management of Data 2021
DOI: 10.1145/3448016.3457250
|View full text |Cite
|
Sign up to set email alerts
|

Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes

Abstract: Complex data pipelines are increasingly common in diverse applications such as BI reporting and ML modeling. These pipelines often recur regularly (e.g., daily or weekly), as BI reports need to be refreshed, and ML models need to be retrained. However, it is widely reported that in complex production pipelines, upstream data feeds can change in unexpected ways, causing downstream applications to break silently that are expensive to resolve.Data validation has thus become an important topic, as evidenced by not… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 46 publications
0
4
0
Order By: Relevance
“…In [137], Song et al have tackled a specific data cleaning problem, i.e., data validation. In a large enterprise data lake with terabytes of data, the data may change with time.…”
Section: Data Cleaning By Validation Rule Inferencementioning
confidence: 99%
See 1 more Smart Citation
“…In [137], Song et al have tackled a specific data cleaning problem, i.e., data validation. In a large enterprise data lake with terabytes of data, the data may change with time.…”
Section: Data Cleaning By Validation Rule Inferencementioning
confidence: 99%
“…The data validation rules indicate whether the changes are significant enough, and will affect the downstream applications. The approach in [137] tries to automatically derive such rules from the machine-generated, string-valued data, rather than inferred by human experts. In principle, it formulates the rule inference problem as an optimization problem, which balances between false-positive-rate minimization and quality issue preserving.…”
Section: Data Cleaning By Validation Rule Inferencementioning
confidence: 99%
“…Such tiny unit tests for data validation can be embedded into a workflow and immediately raise a flag if anything unexpected happens, e.g., newly arrived data violates a primitive that was described within the Great Expectations framework. A similar idea is used in the Auto-Validate [32] system.…”
Section: Related Workmentioning
confidence: 99%
“…The goal of string profiling is to learn succinct regular expression patterns that describe a collection of strings. These profiles are useful for a myriad of applications, from checking the quality of data, computing the syntactic similarity between strings, tagging large datasets with column metadata [Song and He 2021] and making string transformation synthesizers more robust by improving the ranking of programs [Ellis and Gulwani 2017] or learning separate programs for examples with different profiles [Padhi et al 2018].…”
Section: Profilingmentioning
confidence: 99%