Library ofCongress Cataloging-in-Publication DataHilderman, Robert 1. Knowledge discovery and measures of interestlby Robert 1. Hilderman, Howard 1. Hamilton. p. cm. -(The Kluwer international series in engineering and computer science;SECS 638) Includes bibliographical references and index. ISBN 978-1-4419-4913-4 ISBN 978-1-4757-3283-2 (eBook)Data mining algorithms can be broadly classified into two general areas: summarization and anomaly detection [71]. Summarization algorithms find concise descriptions of input data. For example, classificatory algorithms partition input data into disjoint groups. The results of such classification might be represented as a high-level summary, a decision tree, or a set of characteristic rules, as with C4.5 [112], DBLearn [58], and KID3 [110]. Anomaly-detection algorithms identify unusual features of data, such as combinations that occur with greater or lesser frequency than might be expected. For example, association algorithms find, from transaction records, sets of items that appear with 4 East 11 $275.00 3 A summary generated from the cross-product domain for the compound attribute Shape-Size-Colour corresponds to a unique combination of nodes from the DGGs associated with the individual attributes, where one node is selected from the DGG associated with each attribute. For example, given the sales transaction database shown in Table 1.1 (assume the Shape, Size, and Colour attributes have been selected for generalization) and the associated DGGs shown in Figure 1.3, one of the many possible summaries that can be generated is shown in Table 1.5. The summary in Table 1.5 is obtained by generalizing the Shape attribute to the ANY node and the Size attribute to the Package node, while the Colour attribute remains ungeneralized.The complexity of the DGGs is a primary factor determining the number of summaries that can be generated, and depends only upon the number of KNOWLEDGE DISCOVERY AND MEASURES OF INTERESTsatisfying X -+ Y, and I X II Y I / N is the number of tuples expected if X and Y were independent (Le., not associated).When RI = 0, then X and Y are statistically independent and the rule is not interesting. When RI > 0 (RI < 0), then X is positively (negatively) correlated to Y. The significance of the correlation between X and Y can be determined using the chi-square test for a 2 x 2 contingency table. Those rules which do not exceed a predetermined minimum significance threshold are determined to be the most interesting.
One of the most important steps in any knowledge discovery task is the interpretation and evaluation of discovered patterns. To address this problem, various techniques, such as the chi-square test for independence, have been suggested to reduce the number of patterns presented to the user and to focus attention on those that are truly statistically signiaecant. However, when mining a large database, the number of patterns discovered can remain large even after adjusting signiaecance thresholds to eliminate spurious patterns. What is needed, then, is an eaeective measure to further assist in the interpretation and evaluation step that ranks the interestingness of the remaining patterns prior to presenting them to the user. In this paper, we describe a two-step process for ranking the interestingness of discovered patterns that utilizes the chi-square test for independence in the aerst step and objective measures of interestingness in the second step. We show h o w this two-step process can be applied to ranking characterizedègeneralized association rules and data cubes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.