2019
DOI: 10.14778/3342263.3342638
|View full text |Cite
|
Sign up to set email alerts
|

Distributed implementations of dependency discovery algorithms

Abstract: We analyze the problem of discovering dependencies from distributed big data. Existing (non-distributed) algorithms focus on minimizing computation by pruning the search space of possible dependencies. However, distributed algorithms must also optimize communication costs, especially in shared-nothing settings, leading to a more complex optimization space. To understand this space, we introduce six primitives shared by existing dependency discovery algorithms, corresponding to data processing steps separated b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2020
2020
2025
2025

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 21 publications
(10 citation statements)
references
References 19 publications
0
10
0
Order By: Relevance
“…In this section, we evaluate DISTOD's performance in different settings and on various datasets. We compare its runtime with all existing complete OD discovery algorithms, which are FASTOD-BID [25] and its distributed variant DIST-FASTOD-BID [22]. Note that ORDER [15], its hybrid variant [12], and OCDDISCOVER [5] produce incomplete results and are, therefore, not comparable to our approach.…”
Section: Discussionmentioning
confidence: 99%
See 3 more Smart Citations
“…In this section, we evaluate DISTOD's performance in different settings and on various datasets. We compare its runtime with all existing complete OD discovery algorithms, which are FASTOD-BID [25] and its distributed variant DIST-FASTOD-BID [22]. Note that ORDER [15], its hybrid variant [12], and OCDDISCOVER [5] produce incomplete results and are, therefore, not comparable to our approach.…”
Section: Discussionmentioning
confidence: 99%
“…The motivation for this research project is the observation that most distributed data profiling algorithms, including [14,21,22,33], are built on top of dataflow-based distributed computing frameworks, such as Apache Spark [31] or Apache Flink [30]. These frameworks force the discovery algorithms into batch processing, which is an unsuitable paradigm for all known dependency discovery approaches, because they rely on dynamic pruning and dynamic candidate generation tech- niques.…”
Section: Distributed Discovery Of Order Dependenciesmentioning
confidence: 99%
See 2 more Smart Citations
“…An example problem is the discovery of unique column combinations (UCC), which is a set of columns whose projection only contains unique rows. Finding all exact UCCs for a given relation is shown to be NP-hard [8] and various algorithms were introduced in [9,19,22,23]. These algorithms mainly focus on the discovery of exact UCCs, so unclean data might hamper the discovery.…”
Section: Related Workmentioning
confidence: 99%