Efficient classification across multiple database relations: a CrossMine approach

Yin, Xiaoxin; Yang, Jiong; Yu, Pengfei

doi:10.1109/tkde.2006.94

Cited by 79 publications

(47 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In addition, we also present how the databases are pruned. We perform our experiments using the MRC (Guo and Viktor, 2006), RelAggs (Krogel, 2005), TILDE (Blockeel and Raedt, 1998), and CrossMine (Yin et al, 2006) algorithms, with their default settings. The MRC and RelAggs approaches are aggregationbased algorithms where C4.5 decision trees (Quinlan, 1993) were applied as the single-table learner.…”

Section: Resultsmentioning

confidence: 99%

“…We use the length of the join path as the stopping criterion, preferring subgraphs with shorter length. The reason for preferring shorter subgraphs is that semantic links with too many joins are usually very weak in a relational database (Yin et al, 2006). Thus we specify a maximum length for join paths.…”

Section: Algorithm 2 Subgraph Constructionmentioning

confidence: 99%

“…The second learning problem (F682AC) attempts to classify if the loan is good or bad from the 682 instances, regardless of whether the loan is finished or not. Our third experimental task (F400AC) uses the Financial database as prepared in (Yin et al, 2006), which has 400 examples in the target table.…”

Section: Financial Databasementioning

confidence: 99%

“…The database generator was obtained from Yin et al (2006). In their paper, Yin et al used this database generator to create synthetic databases to mimic realworld databases in order to evaluate the scalability of the multirelational classification algorithm CrossMine.…”

Section: Synthetic Databasesmentioning

confidence: 99%

“…Multirelational classification, which aims to discover patterns across multiple interlinked tables (relations) in a relational database, poses a unique opportunity for the data mining community (Quinlan and Cameron-Jones, 1993;Zhong and Ohsuga, 1995;Dehaspe et al, 1998;Blockeel and Raedt, 1998;Dzeroski and Lavrac, 2001;Jensen et al, 2002;Jamil, 2002;Han and Kamber, 2005;Krogel, 2005;Burnside et al, 2005;Ceci and Appice, 2006;Yin et al, 2006;Frank et al, 2007;Getoor and Taskar, 2007;Bhattacharya and Getoor, 2007;Landwehr et al, 2007;Rückert and Kramer, 2008;De Raedt, 2008;Chen et al, 2009;Landwehr et al, 2010;Guo et al, 2011). Such relational databases are currently one of the most popular types of relational data repositories.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Reducing the size of databases for multirelational classification: a subgraph-based approach

2012

View full text Add to dashboard Cite

/npsi/ctrl?lang=en http://nparc.cisti-icist.nrc-cnrc.gc.ca/npsi/ctrl?lang=fr Access and use of this website and the material on it are subject to the Terms and Conditions set forth at http://nparc.cisti-icist.nrc-cnrc.gc.ca/npsi/jsp/nparc_cp.jsp?lang=en NRC Publications Archive Archives des publications du CNRCThis publication could be one of several versions: author's original, accepted manuscript or the publisher's version. / La version de cette publication peut être l'une des suivantes : la version prépublication de l'auteur, la version acceptée du manuscrit ou la version de l'éditeur. For the publisher's version, please access the DOI link below./ Pour consulter la version de l'éditeur, utilisez le lien DOI ci-dessous.http://dx.doi.org/10.1007/s10844-012-0229-0 Systems, November 2012, 2012 Reducing the size of databases for multirelational classification : a subgraph-based approach Guo, Hongyu; Viktor, Herna L.; Paquet, Eric Abstract Multirelational classification aims to discover patterns across multiple interlinked tables (relations) in a relational database. In many large organizations, such a database often spans numerous departments and/or subdivisions, which are involved in different aspects of the enterprise such as customer profiling, fraud detection, inventory management, financial management, and so on. When considering classification, different phases of the knowledge discovery process are affected by economic utility. For instance, in the data preprocessing process, one must consider the cost associated with acquiring, cleaning, and transforming large volumes of data. When training and testing the data mining models, one has to consider the impact of the data size on the running time of the learning algorithm. In order to address these utility-based issues, the paper presents an approach to create a pruned database for multirelational classification, while minimizing predictive performance loss on the final model. Our method identifies a set of strongly uncorrelated subgraphs from the original database schema, to use for training, and discards all others. The experiments performed show that our strategy is able to, without sacrificing predictive accuracy, significantly reduce the size of the databases, in terms of the number of relations, tuples, and attributes. The approach prunes the sizes of databases by as much as 94%. Such reduction also results in decreasing computational cost of the learning process. The method improves the multirelational learning algorithms' execution time by as much as 80%. In particular, our results demonstrate that one may build an accurate model with only a small subset of the provided database. Journal of Intelligent Information

show abstract