Distributed implementations of dependency discovery algorithms

Saxena, Hemant; Golab, Lukasz; Ilyas, Ihab F.

doi:10.14778/3342263.3342638

Cited by 21 publications

(10 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we evaluate DISTOD's performance in different settings and on various datasets. We compare its runtime with all existing complete OD discovery algorithms, which are FASTOD-BID [25] and its distributed variant DIST-FASTOD-BID [22]. Note that ORDER [15], its hybrid variant [12], and OCDDISCOVER [5] produce incomplete results and are, therefore, not comparable to our approach.…”

Section: Discussionmentioning

confidence: 99%

“…The motivation for this research project is the observation that most distributed data profiling algorithms, including [14,21,22,33], are built on top of dataflow-based distributed computing frameworks, such as Apache Spark [31] or Apache Flink [30]. These frameworks force the discovery algorithms into batch processing, which is an unsuitable paradigm for all known dependency discovery approaches, because they rely on dynamic pruning and dynamic candidate generation tech- niques.…”

Section: Distributed Discovery Of Order Dependenciesmentioning

confidence: 99%

“…Distributed order dependency discovery Because FASTOD-BID is the only complete and correct OD algorithm, not much research exists on distributed OD discovery. In [22], Saxena et al proposed common map-reduce style primitives (based on Apache Spark) into which they could break down any existing data profiling algorithm. In this way, they presented distributed versions of different dependency discovery algorithms including FASTOD-BID-we call this implementation DIST-FASTOD-BID.…”

Section: Related Workmentioning

confidence: 99%

“…This has the advantage that we can discard the attribute type information and all concrete values, which saves memory and-because the computations effectively deal with integers only-makes the operations on partitions fast and simple. Stripped partitions DISTOD requires only the sorted partitions for each {A} ∈ R (level l 1 of the candidate lattice); for attribute sets with |X | > 1 (higher levels), it replaces sorted partitions with the smaller stripped partitions [5,10,12,22,25,26] (also known as position list indexes [2,15,19,20]).…”

Section: To Give An Example Consider Table 1 and The Partitionmentioning

confidence: 99%

See 3 more Smart Citations

Efficient distributed discovery of bidirectional order dependencies

Schmidl

Papenbrock

2021

The VLDB Journal

View full text Add to dashboard Cite

Bidirectional order dependencies (bODs) capture order relationships between lists of attributes in a relational table. They can express that, for example, sorting books by publication date in ascending order also sorts them by age in descending order. The knowledge about order relationships is useful for many data management tasks, such as query optimization, data cleaning, or consistency checking. Because the bODs of a specific dataset are usually not explicitly given, they need to be discovered. The discovery of all minimal bODs (in set-based canonical form) is a task with exponential complexity in the number of attributes, though, which is why existing bOD discovery algorithms cannot process datasets of practically relevant size in a reasonable time. In this paper, we propose the distributed bOD discovery algorithm DISTOD, whose execution time scales with the available hardware. DISTOD is a scalable, robust, and elastic bOD discovery approach that combines efficient pruning techniques for bOD candidates in set-based canonical form with a novel, reactive, and distributed search strategy. Our evaluation on various datasets shows that DISTOD outperforms both single-threaded and distributed state-of-the-art bOD discovery algorithms by up to orders of magnitude; it can, in particular, process much larger datasets.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Distributed Discovery Of Order Dependenciesmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: To Give An Example Consider Table 1 and The Partitionmentioning

confidence: 99%

See 2 more Smart Citations

Efficient distributed discovery of bidirectional order dependencies

Schmidl

Papenbrock

2021

The VLDB Journal

View full text Add to dashboard Cite

show abstract

“…An example problem is the discovery of unique column combinations (UCC), which is a set of columns whose projection only contains unique rows. Finding all exact UCCs for a given relation is shown to be NP-hard [8] and various algorithms were introduced in [9,19,22,23]. These algorithms mainly focus on the discovery of exact UCCs, so unclean data might hamper the discovery.…”

Section: Related Workmentioning

confidence: 99%

PatchIndex: exploiting approximate constraints in distributed databases

Kläbe

Sattler

Baumann

2021

Distrib Parallel Databases

View full text Add to dashboard Cite

Cloud data warehouse systems lower the barrier to access data analytics. These applications often lack a database administrator and integrate data from various sources, potentially leading to data not satisfying strict constraints. Automatic schema optimization in self-managing databases is difficult in these environments without prior data cleaning steps. In this paper, we focus on constraint discovery as a subtask of schema optimization. Perfect constraints might not exist in these unclean datasets due to a small set of values violating the constraints. Therefore, we introduce the concept of a generic PatchIndex structure, which handles exceptions to given constraints and enables database systems to define these approximate constraints. We apply the concept to the environment of distributed databases, providing parallel index creation approaches and optimization techniques for parallel queries using PatchIndexes. Furthermore, we describe heuristics for automatic discovery of PatchIndex candidate columns and prove the performance benefit of using PatchIndexes in our evaluation.

show abstract

GDS: General Distributed Strategy for Functional Dependency Discovery Algorithms

Yang

Wang

et al. 2020

Database Systems for Advanced Applications

View full text Add to dashboard Cite

Distributed implementations of dependency discovery algorithms

Cited by 21 publications

References 19 publications

Efficient distributed discovery of bidirectional order dependencies

Efficient distributed discovery of bidirectional order dependencies

PatchIndex: exploiting approximate constraints in distributed databases

GDS: General Distributed Strategy for Functional Dependency Discovery Algorithms

Contact Info

Product

Resources

About