A Statistical Perspective on Discovering Functional Dependencies in Noisy Data

Zhang, Yunjia; Guo, Zhihan; Ρεκατσίνας, Θεόδωρος

doi:10.1145/3318464.3389749

Cited by 25 publications

(11 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To find FDs efficiently, existing approaches can be classified into three categories: (1) Tuple-oriented methods (e.g., FastFDs [29], DepMiner [20]) that exploit the notion of tuples agreeing on the same values to determine the combinations of attributes of an FD; (2) Attribute-oriented methods (e.g., Tane [11], [12], Fun [21], [22], FDMine [30]) that use pruning techniques and reduce the search space to the necessary set of attributes of the relation to discover exact and approximate FDs. HyFD [25] exploits simultaneously the tuple-and attribute-oriented approaches to outperform the previous approaches; and more recently (3) Structure learning methods relying on sparse regression [31], or on entropy-based measures [15] to score candidate constraints (not limited to FDs alone). More particularly, FDX [31] performs structure learning over a sample constructed by taking the value differences over sampled pairs of tuples from the raw data.…”

Section: Related Workmentioning

confidence: 99%

Provenance-aware Discovery of Functional Dependencies on Integrated Views

Comignani

Berti‐Équille²,

Novelli

et al. 2022

2022 IEEE 38th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

The automatic discovery of functional dependencies (FDs) has been widely studied as one of the hardest problems in data profiling. Existing approaches have focused on making the FD computation efficient while inspecting single relations at a time. In this paper, for the first time we address the problem of inferring FDs for multiple relations as they occur in integrated views by solely using the functional dependencies of the base relations of the view itself. To this purpose, we leverage logical inference and selective mining and show that we can discover most of the exact FDs from the base relations and avoid the full computation of the FDs for the integrated view itself, while at the same time preserving the lineage of FDs of base relations. We propose algorithms to speedup the inferred FD discovery process and mine FDs on-the-fly only from necessary data partitions. We present InFine (INferred FunctIoNal dEpendency), an end-to-end solution to discover inferred FDs on integrated views by leveraging provenance information of base relations. Our experiments on a range of real-world and synthetic datasets demonstrate the benefits of our method over existing FD discovery methods that need to rerun the discovery process on the view from scratch and cannot exploit lineage information on the FDs. We show that InFine outperforms traditional methods necessitating the full integrated view computation by one to two order of magnitude in terms of runtime. It is also the most memory efficient method while preserving FD provenance information using mainly inference from base table with negligible execution time.

show abstract

Section: Related Workmentioning

confidence: 99%

Provenance-aware Discovery of Functional Dependencies on Integrated Views

Comignani

Berti‐Équille²,

Novelli

et al. 2022

2022 IEEE 38th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

show abstract

“…In particular, the paper extends the large sample computational hardness results in Chickering et al [15] to a setting that is important in economic theory. A number of papers have proposed algorithms for finding Bayesian network structures despite the computational hardness of the problem (Caravagna and Ramazzotti [16], Constantinou et al [17], Malone et al [18], Platas-Lopez et al [19], Talvitie et al [20], Zhang et al [21]; and Scarnagatta et al [22] for a survey of the older literature). Unlike these papers, the focus here is on showing that hardness affects a large class of learning problems in economic settings, and that can lead to counterintuitive results in applied areas like finance.…”

Section: Literature Reviewmentioning

confidence: 99%

Hardness of Learning in Rich Environments and Some Consequences for Financial Markets

Bhattacharya

2021

MAKE

View full text Add to dashboard Cite

This paper examines the computational feasibility of the standard model of learning in economic theory. It is shown that the information update technique at the heart of this model is impossible to compute in all but the simplest scenarios. Specifically, using tools from theoretical machine learning, the paper first demonstrates that there is no polynomial implementation of the model unless the independence structure of variables in the data is publicly known. Next, it is shown that there cannot exist a polynomial algorithm to infer the independence structure; consequently, the overall learning problem does not have a polynomial implementation. Using the learning model when it is computationally infeasible carries risks, and some of these are explored in the latter part of the paper in the context of financial markets. Especially in rich, high-frequency environments, it implies discarding a lot of useful information, and this can lead to paradoxical outcomes in interactive game-theoretic situations. This is illustrated in a trading example where market prices can never reflect an informed trader’s information, no matter how many rounds of trade. The paper provides new theoretical motivation for the use of bounded rationality models in the study of financial asset pricing—the bound on rationality arising from the computational hardness in learning.

show abstract

“…Existing methods identify minimal FDs from a single table. However, minimality does not guarantee that the set of discovered FDs will be parsimonious [14,28] and minimal FDs from single tables are not necessarily minimal multi-table FDs. In this paper, we propose the first approach that discover efficiently join FDs from multiple tables of a database.…”

Section: Figure 1: Illustrative Examplementioning

confidence: 99%

“…To find FDs efficiently, existing approaches can be classified into three categories: (1) Tupleoriented methods (e.g., FastFDs [26], DepMiner [16]) that exploit the notion of tuples agreeing on the same values to determine the combinations of attributes of an FD; (2) Attribute-oriented methods (e.g., Tane [8,9], Fun [17,18], FDMine [27]) that use pruning techniques and reduce the search space to the necessary set of attributes of the relation to discover exact and approximate FDs. HyFD [22] exploits simultaneously the tuple-and attribute-oriented approaches to outperform the previous approaches; and more recently (3) Structure learning methods relying on sparse regression [28], or on entropy-based measures [12] to score candidate constraints (not limited to FDs alone). More particularly, FDX [28] performs structure learning over a sample constructed by taking the value differences over sampled pairs of tuples from the raw data.…”

Section: Related Workmentioning

confidence: 99%

“…HyFD [22] exploits simultaneously the tuple-and attribute-oriented approaches to outperform the previous approaches; and more recently (3) Structure learning methods relying on sparse regression [28], or on entropy-based measures [12] to score candidate constraints (not limited to FDs alone). More particularly, FDX [28] performs structure learning over a sample constructed by taking the value differences over sampled pairs of tuples from the raw data.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Discovering Multi-Table Functional Dependencies Without Full Join Computation

Comignani,

Berti-Équille,

Novelli

2020

Preprint

View full text Add to dashboard Cite

In this paper, we study the problem of discovering join FDs, i.e., functional dependencies (FDs) that hold on multiple joined tables. We leverage logical inference, selective mining, and sampling and show that we can discover most of the exact join FDs from the single tables participating to the join and avoid the full computation of the join result. We propose algorithms to speed-up the join FD discovery process and mine FDs on the fly only from necessary data partitions. We introduce JEDI (Join dEpendency DIscovery), our solution to discover join FDs without computation of the full join beforehand. Our experiments on a range of real-world and synthetic data demonstrate the benefits of our method over existing FD discovery methods that need to precompute the join results before discovering the FDs. We show that the performance depends on the cardinalities and coverage of the join attribute values: for join operations with low coverage, JEDI with selective mining outperforms the competing methods using the straightforward approach of full join computation by one order of magnitude in terms of runtime and can discover three-quarters of the exact join FDs using mainly logical inference in half of its total execution time on average. For higher join coverage, JEDI with sampling reaches precision of 1 with only 63% of the table input size on average.

show abstract

A Statistical Perspective on Discovering Functional Dependencies in Noisy Data

Cited by 25 publications

References 35 publications

Provenance-aware Discovery of Functional Dependencies on Integrated Views

Provenance-aware Discovery of Functional Dependencies on Integrated Views

Hardness of Learning in Rich Environments and Some Consequences for Financial Markets

Discovering Multi-Table Functional Dependencies Without Full Join Computation

Contact Info

Product

Resources

About