Interestingness measures play an important role in data mining, regardless of the kind of patterns being mined. These measures are intended for selecting and ranking patterns according to their potential interest to the user. Good measures also allow the time and space costs of the mining process to be reduced. This survey reviews the interestingness measures for rules and summaries, classifies them from several perspectives, compares their properties, identifies their roles in the data mining process, gives strategies for selecting appropriate measures for applications, and identifies opportunities for future research in this area.
Most approaches to mining association rules implicitly consider the utilities of the itemsets to be equal. We assume that the utilities of itemsets may differ, and identify the high utility itemsets based on information in the transaction database and external information about utilities. Our theoretical analysis of the resulting problem lays the foundation for future utility mining algorithms.
Library ofCongress Cataloging-in-Publication DataHilderman, Robert 1. Knowledge discovery and measures of interestlby Robert 1. Hilderman, Howard 1. Hamilton. p. cm. -(The Kluwer international series in engineering and computer science;SECS 638) Includes bibliographical references and index. ISBN 978-1-4419-4913-4 ISBN 978-1-4757-3283-2 (eBook)Data mining algorithms can be broadly classified into two general areas: summarization and anomaly detection [71]. Summarization algorithms find concise descriptions of input data. For example, classificatory algorithms partition input data into disjoint groups. The results of such classification might be represented as a high-level summary, a decision tree, or a set of characteristic rules, as with C4.5 [112], DBLearn [58], and KID3 [110]. Anomaly-detection algorithms identify unusual features of data, such as combinations that occur with greater or lesser frequency than might be expected. For example, association algorithms find, from transaction records, sets of items that appear with 4 East 11 $275.00 3 A summary generated from the cross-product domain for the compound attribute Shape-Size-Colour corresponds to a unique combination of nodes from the DGGs associated with the individual attributes, where one node is selected from the DGG associated with each attribute. For example, given the sales transaction database shown in Table 1.1 (assume the Shape, Size, and Colour attributes have been selected for generalization) and the associated DGGs shown in Figure 1.3, one of the many possible summaries that can be generated is shown in Table 1.5. The summary in Table 1.5 is obtained by generalizing the Shape attribute to the ANY node and the Size attribute to the Package node, while the Colour attribute remains ungeneralized.The complexity of the DGGs is a primary factor determining the number of summaries that can be generated, and depends only upon the number of KNOWLEDGE DISCOVERY AND MEASURES OF INTERESTsatisfying X -+ Y, and I X II Y I / N is the number of tuples expected if X and Y were independent (Le., not associated).When RI = 0, then X and Y are statistically independent and the rule is not interesting. When RI > 0 (RI < 0), then X is positively (negatively) correlated to Y. The significance of the correlation between X and Y can be determined using the chi-square test for a 2 x 2 contingency table. Those rules which do not exceed a predetermined minimum significance threshold are determined to be the most interesting.
In this paper, we propose an efficient rule discovery algorithm, called FD_Mine, for mining functional dependencies from data. By exploiting Armstrong's Axioms for functional dependencies, we identify equivalences among attributes, which can be used to reduce both the size of the dataset and the number of functional dependencies to be checked. We first describe four effective pruning rules that reduce the size of the search space. In particular, the number of functional dependencies to be checked is reduced by skipping the search for FDs that are logically implied by already discovered FDs. Then, we present the FD_Mine algorithm, which incorporates the four pruning rules into the mining process. We prove the correctness of FD_Mine, that is, we show that the pruning does not lead to the loss of useful information. We report the results of a series of experiments. These experiments show that the proposed algorithm is effective on 15 UCI datasets and synthetic data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations鈥揷itations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright 漏 2024 scite LLC. All rights reserved.
Made with 馃挋 for researchers
Part of the Research Solutions Family.