Data Profiling

Abedjan, Ziawasch; Golab, Lukasz; Naumann, Felix; Papenbrock, Thorsten

doi:10.2200/s00878ed1v01y201810dtm052

Cited by 24 publications

(60 citation statements)

References 188 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Traditionally, data dependencies stem from data modeling and schema design, e.g., 3NF synthesis, or BCNF decomposition, but data profiling identifies data dependencies from the data themselves, independently of such processes. Because the discovery of data dependencies (of any type) is NP-complete [5] and sometimes even W[2]-to W[3]-complete [35], mining data dependencies is challenging. For this reason, we also provide pointers to the most recent automatic discovery and maintenance algorithms, which in practice are sufficiently fast to be useful in the context of query optimization on realworld datasets.…”

Section: Data Dependenciesmentioning

confidence: 99%

“…For this reason, we also provide pointers to the most recent automatic discovery and maintenance algorithms, which in practice are sufficiently fast to be useful in the context of query optimization on realworld datasets. An introductory overview of data profiling techniques can be found in [5], while a comprehensive survey is given in [3].…”

Section: Data Dependenciesmentioning

confidence: 99%

“…(iv) UCCs can occur by chance, 2 especially for column combinations containing many columns. Discovery and maintenance of UCCs Unique column combinations, especially minimal ones, are neither obvious nor simple to determine, as UCC discovery is a problem in O(2 m ) for datasets with m attributes [5,77]. Therefore, efficient, automatic UCC discovery algorithms exist, which can serve these unique column combinations to a query optimizer.…”

Section: Origin Of Uccsmentioning

confidence: 99%

See 2 more Smart Citations

Data dependencies for query optimization: a survey

2021

Self Cite

View full text Add to dashboard Cite

Effective query optimization is a core feature of any database management system. While most query optimization techniques make use of simple metadata, such as cardinalities and other basic statistics, other optimization techniques are based on more advanced metadata including data dependencies, such as functional, uniqueness, order, or inclusion dependencies. This survey provides an overview, intuitive descriptions, and classifications of query optimization and execution strategies that are enabled by data dependencies. We consider the most popular types of data dependencies and focus on optimization strategies that target the optimization of relational database queries. The survey supports database vendors to identify optimization opportunities as well as DBMS researchers to find related work and open research questions.

show abstract

Section: Data Dependenciesmentioning

confidence: 99%

Section: Data Dependenciesmentioning

confidence: 99%

Section: Origin Of Uccsmentioning

confidence: 99%

See 1 more Smart Citation

Data dependencies for query optimization: a survey

2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…• Task (5) was not implemented at all, as data were stored locally in OS files. As a consequence, data were not shared and access to data was non-optimal.…”

Section: Motivation: Real Casesmentioning

confidence: 99%

“…These metadata is used to select a type of technology and not a specific DBMS. -Data characteristics |4|: describing the set of data characteristics that are required to automate the transformation from the logical to the physical model, obtained by traditional data profiling techniques [5]. -Rules for data cleaning and deduplication |5|: describing domain rules for data cleaning and removing duplicates.…”

Section: Challenge 2: Metadata Managementmentioning

confidence: 99%

Data Engineering for Data Science: Two Sides of the Same Coin

Romero

Wrembel

2020

Big Data Analytics and Knowledge Discovery

View full text Add to dashboard Cite

A de facto technological standard of data science is based on notebooks (e.g., Jupyter), which provide an integrated environment to execute data workflows in different languages. However, from a data engineering point of view, this approach is typically inefficient and unsafe, as most of the data science languages process data locally, i.e., in workstations with limited memory, and store data in files. Thus, this approach neglects the benefits brought by over 40 years of R&D in the area of data engineering, i.e., advanced database technologies and data management techniques. In this paper, we advocate for a standardized data engineering approach for data science and we present a layered architecture for a data processing pipeline (DPP). This architecture provides a comprehensive conceptual view of DPPs, which next enables the semi-automation of the logical and physical designs of such DPPs.

show abstract