/npsi/ctrl?lang=en http://nparc.cisti-icist.nrc-cnrc.gc.ca/npsi/ctrl?lang=fr Access and use of this website and the material on it are subject to the Terms and Conditions set forth at http://nparc.cisti-icist.nrc-cnrc.gc.ca/npsi/jsp/nparc_cp.jsp?lang=en
NRC Publications Archive Archives des publications du CNRCThis publication could be one of several versions: author's original, accepted manuscript or the publisher's version. / La version de cette publication peut être l'une des suivantes : la version prépublication de l'auteur, la version acceptée du manuscrit ou la version de l'éditeur. For the publisher's version, please access the DOI link below./ Pour consulter la version de l'éditeur, utilisez le lien DOI ci-dessous.http://dx.doi.org/10.1007/s10844-012-0229-0 Systems, November 2012, 2012 Reducing the size of databases for multirelational classification : a subgraph-based approach Guo, Hongyu; Viktor, Herna L.; Paquet, Eric Abstract Multirelational classification aims to discover patterns across multiple interlinked tables (relations) in a relational database. In many large organizations, such a database often spans numerous departments and/or subdivisions, which are involved in different aspects of the enterprise such as customer profiling, fraud detection, inventory management, financial management, and so on. When considering classification, different phases of the knowledge discovery process are affected by economic utility. For instance, in the data preprocessing process, one must consider the cost associated with acquiring, cleaning, and transforming large volumes of data. When training and testing the data mining models, one has to consider the impact of the data size on the running time of the learning algorithm. In order to address these utility-based issues, the paper presents an approach to create a pruned database for multirelational classification, while minimizing predictive performance loss on the final model. Our method identifies a set of strongly uncorrelated subgraphs from the original database schema, to use for training, and discards all others. The experiments performed show that our strategy is able to, without sacrificing predictive accuracy, significantly reduce the size of the databases, in terms of the number of relations, tuples, and attributes. The approach prunes the sizes of databases by as much as 94%. Such reduction also results in decreasing computational cost of the learning process. The method improves the multirelational learning algorithms' execution time by as much as 80%. In particular, our results demonstrate that one may build an accurate model with only a small subset of the provided database.
Journal of Intelligent Information