Data sources for data integration often come with spurious schema definitions such as undefined foreign key constraints. Such metadata are important for querying the database and for database integration.We present our algorithm SPIDER (Single Pass Inclusion DEpendency Recognition) for detecting inclusion dependencies, as these are the automatically testable part of a foreign key constraint. For IND detection all pairs of attributes must be tested. SPIDER solves this task very efficiently by testing all attribute pairs in parallel. It analyzes a 2 GB database in ∼ 20 min and a 21 GB database in ∼ 4 h. Schema Discovery for Data IntegrationIn large integration projects one is often confronted with undocumented data sources. One important schema information are foreign key constraints, which are necessary for meaningful querying and for database integration.One example is the popular life science database Protein Data Bank (PDB) that can be imported into a relational schema using the OpenMMS 1 schema and parser. The OpenMMS schema defines 175 tables with 2, 705 attributes but not a single foreign key constraint.If foreign key constraints are not defined explicitly they are given implicitly by the data. Thus, we want to identify inclusion dependencies (INDs) in a schema as they are a precondition for foreign keys. An inclusion dependency A ⊆ B means that all values of the dependent attribute A are contained in the value set of the referenced attribute B. We call a pair of attributes A and B an IND candidate. An IND is satisfied if the IND requirements are met and unsatisfied otherwise. Obviously, a satisfied IND is only the 1 openmms.sdsc.edu automatically testable part of a foreign key. Whether a special IND corresponds to a semantically correct foreign key must be decided in a second step.The challenge in detecting INDs is the potentially large number of IND candidates as each pair of attributes must be tested. With n attributes this results in (n − 1) 2 IND candidates. To the best of our knowledge all previous approaches for unary IND detection restrict IND candidates by the attribute's data type as proposed in [5], i. e., only IND candidates are created with both attributes sharing the same data type. This restriction reduces the problem complexity heavily, but in life science databases we cannot use this restriction, because one cannot rely on datatypes.In this paper we present our algorithm SPIDER to detect unary INDs. SPIDER works in two phases: (i) all attribute value sets are sorted inside a RDBMS, and (ii) all IND candidates are tested in parallel while reading each attribute's value at most once. Its strength is the data structure that synchronizes all IND candidate tests efficiently without running into deadlocks or missing an IND.There are two approaches in related work on detecting unary Afterwards all IND candidates are tested in parallel exploiting the sets of attribute names. We shall show in Sec. 3 that SPIDER outperforms both approaches up to orders of magnitude.In [2] we tested the perfor...
Large data integration projects must often cope with undocumented data sources. Schema discovery aims at automatically finding structures in such cases. An important class of relationships between attributes that can be detected automatically are inclusion dependencies (IND), which provide an excellent basis for guessing foreign key constraints. INDs can be discovered by comparing the sets of distinct values of pairs of attributes.In this paper we present efficient algorithms for finding unary INDs. We first show that (and why) SQL is not suitable for this task. We then develop two algorithms that compute inclusion dependencies outside of the database. Both are much faster than the SQL-based methods; in fact, for larger schemas they are the only feasible solution. Our experiments show that we can compute all unary INDs in a schema of 1, 680 attributes with a total database size of 3.2 GB in approximately 2.5 hours.
Data dependencies are used to improve the quality of a database schema, to optimize queries, and to ensure consistency in a database. Conditional dependencies have been introduced to analyze and improve data quality. A conditional dependency is a dependency with a limited scope defined by conditions over one or more attributes. Only the matching part of the instance must adhere to the dependency. In this paper we focus on conditional inclusion dependencies (Cinds).We generalize the definition of Cinds, distinguishing covering and completeness conditions. We present a new use case for such Cinds showing their value for solving complex data quality tasks. Further, we propose efficient algorithms that identify covering and completeness conditions conforming to given quality thresholds. Our algorithms choose not only the condition values but also the condition attributes automatically. Finally, we show that our approach efficiently provides meaningful and helpful results for our use case.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.