Efficiently Computing Inclusion Dependencies for Schema Discovery

Bauckmann, Jana; Leser, Ulf; Naumann, Felix

doi:10.1109/icdew.2006.54

Cited by 20 publications

(32 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, many studies have addressed the problem of helping users find integrity constraints from an existing data instance. However, most existing techniques address the problem of supporting the discovery of data integrity constraints in the context of relational databases [1] [2]. To the best of our knowledge, only a few papers address the problem of supporting the discovery of data integrity constraints in the Web context.…”

Section: Introductionmentioning

confidence: 99%

“…To the best of our knowledge, the complexity of the fastest algorithm to check if a pair (e i , e j ) is an inclusion is in O(n) for the size of the sets [2] under the assumption that we sort the words in e i and e j before the calculation.…”

Section: Strict Comparisonmentioning

confidence: 99%

“…Bauckmann and others [2] proposed an algorithm that takes as input a set of relations and efficiently enumerates all pairs of relational attributes one of which includes the other. The algorithm is designed to minimize the amount of I/O over the sets of attribute values.…”

Section: Related Workmentioning

confidence: 99%

“…Because it is inevitable that finding all inclusion dependencies in a given data set yields false positives, a common approach is first to enumerate all possible candidates (including false positives) [2] and then to verify the enumerated candidates. This paper discusses algorithms for this two-phase approach in the Web context.…”

Section: Introductionmentioning

confidence: 99%

“…We want the scheme to be efficient, because the number of Web pages can be large. (2) Dealing with the characteristics of Web content. We want the scheme to be able to deal with the characteristics of Web content, because Web pages have hierarchical structures and their data are not necessarily clean.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Efficient filtering and ranking schemes for finding inclusion dependencies on the web

Morishima

Yumiya

Takahashi

et al. 2013

Proceedings of the 22nd ACM International Conference on Information &Amp; Knowledge Management

View full text Add to dashboard Cite

Data integrity constraints are fundamental in various applications, such as data management, integration, cleaning, and schema extraction. This paper presents the results of a first comprehensive study on finding inclusion dependencies on the Web. The problem is important because (1) applications of inclusion dependencies, such as data quality management, are beneficial in the Web context, and (2) such dependencies are not explicitly given in general. In our approach, we enumerate pairs of HTML/XML elements that possibly represent inclusion dependencies and then rank the results for verification. First, we propose a bit-based signature scheme to efficiently select candidates (element pairs) in the enumeration process. The signature scheme is unique in that it supports Jaccard containment to deal with the incomplete nature of data on the Web, and preserves the semiorder inclusion relationship among sets of words. Second, we propose a ranking scheme to support a user in checking whether each enumerated pair actually suggests inclusion dependencies. The ranking scheme sorts the enumerated pairs so that we can examine a small number of pairs for simultaneously verifying many pairs. Finally, we prove that there exist efficient algorithms for the ranking scheme. In addition to the theoretical results for the signature and ranking schemes, we present a comprehensive set of experimental results using various real Web sites. The results show that in the enumeration process the signature scheme reduces the number of candidate pairs by orders of magnitude, and that the ranking scheme allows a small number of higher ranked results to cover many other pairs.2

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Strict Comparisonmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations