Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data 2020
DOI: 10.1145/3318464.3389749
|View full text |Cite
|
Sign up to set email alerts
|

A Statistical Perspective on Discovering Functional Dependencies in Noisy Data

Abstract: We study the problem of discovering functional dependencies (FD) from a noisy data set. We adopt a statistical perspective and draw connections between FD discovery and structure learning in probabilistic graphical models. We show that discovering FDs from a noisy data set is equivalent to learning the structure of a model over binary random variables, where each random variable corresponds to a functional of the data set attributes. We build upon this observation to introduce FDX a conceptually simple framewo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
11
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 25 publications
(11 citation statements)
references
References 35 publications
0
11
0
Order By: Relevance
“…To find FDs efficiently, existing approaches can be classified into three categories: (1) Tuple-oriented methods (e.g., FastFDs [29], DepMiner [20]) that exploit the notion of tuples agreeing on the same values to determine the combinations of attributes of an FD; (2) Attribute-oriented methods (e.g., Tane [11], [12], Fun [21], [22], FDMine [30]) that use pruning techniques and reduce the search space to the necessary set of attributes of the relation to discover exact and approximate FDs. HyFD [25] exploits simultaneously the tuple-and attribute-oriented approaches to outperform the previous approaches; and more recently (3) Structure learning methods relying on sparse regression [31], or on entropy-based measures [15] to score candidate constraints (not limited to FDs alone). More particularly, FDX [31] performs structure learning over a sample constructed by taking the value differences over sampled pairs of tuples from the raw data.…”
Section: Related Workmentioning
confidence: 99%
“…To find FDs efficiently, existing approaches can be classified into three categories: (1) Tuple-oriented methods (e.g., FastFDs [29], DepMiner [20]) that exploit the notion of tuples agreeing on the same values to determine the combinations of attributes of an FD; (2) Attribute-oriented methods (e.g., Tane [11], [12], Fun [21], [22], FDMine [30]) that use pruning techniques and reduce the search space to the necessary set of attributes of the relation to discover exact and approximate FDs. HyFD [25] exploits simultaneously the tuple-and attribute-oriented approaches to outperform the previous approaches; and more recently (3) Structure learning methods relying on sparse regression [31], or on entropy-based measures [15] to score candidate constraints (not limited to FDs alone). More particularly, FDX [31] performs structure learning over a sample constructed by taking the value differences over sampled pairs of tuples from the raw data.…”
Section: Related Workmentioning
confidence: 99%
“…In particular, the paper extends the large sample computational hardness results in Chickering et al [15] to a setting that is important in economic theory. A number of papers have proposed algorithms for finding Bayesian network structures despite the computational hardness of the problem (Caravagna and Ramazzotti [16], Constantinou et al [17], Malone et al [18], Platas-Lopez et al [19], Talvitie et al [20], Zhang et al [21]; and Scarnagatta et al [22] for a survey of the older literature). Unlike these papers, the focus here is on showing that hardness affects a large class of learning problems in economic settings, and that can lead to counterintuitive results in applied areas like finance.…”
Section: Literature Reviewmentioning
confidence: 99%
“…Existing methods identify minimal FDs from a single table. However, minimality does not guarantee that the set of discovered FDs will be parsimonious [14,28] and minimal FDs from single tables are not necessarily minimal multi-table FDs. In this paper, we propose the first approach that discover efficiently join FDs from multiple tables of a database.…”
Section: Figure 1: Illustrative Examplementioning
confidence: 99%
“…To find FDs efficiently, existing approaches can be classified into three categories: (1) Tupleoriented methods (e.g., FastFDs [26], DepMiner [16]) that exploit the notion of tuples agreeing on the same values to determine the combinations of attributes of an FD; (2) Attribute-oriented methods (e.g., Tane [8,9], Fun [17,18], FDMine [27]) that use pruning techniques and reduce the search space to the necessary set of attributes of the relation to discover exact and approximate FDs. HyFD [22] exploits simultaneously the tuple-and attribute-oriented approaches to outperform the previous approaches; and more recently (3) Structure learning methods relying on sparse regression [28], or on entropy-based measures [12] to score candidate constraints (not limited to FDs alone). More particularly, FDX [28] performs structure learning over a sample constructed by taking the value differences over sampled pairs of tuples from the raw data.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation