Efficient disk-based K-means clustering for relational databases

Ordóñez, Carlos; Omiecinski, Edward

doi:10.1109/tkde.2004.25

Cited by 79 publications

(45 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…How to reduce the number of times the whole dataset is scanned so as to save the computation cost is one of the most important things in all the frequent pattern studies. The similar situation also exists in data clustering and classification studies because the design concept of earlier algorithms, such as mining the patterns on-the-fly [46], mining partial patterns at different stages [47], and reducing the number of times the whole dataset is scanned [32], are therefore presented to enhance the performance of these mining algorithms. Since some of the data mining problems are NP-hard [48] or the solution space is very large, several recent studies [23,49] have attempted to use metaheuristic algorithm as the mining algorithm to get the approximate solution within a reasonable time.…”

Section: Discussionmentioning

confidence: 90%

Big data analytics: a survey

Tsai

Lai

Chao

et al. 2015

Journal of Big Data

666

310

View full text Add to dashboard Cite

Section: Discussionmentioning

confidence: 90%

Big data analytics: a survey

Tsai

Lai

Chao

et al. 2015

Journal of Big Data

666

310

View full text Add to dashboard Cite

“…Since is usually much larger than both and , the complexity becomes near linear to the number of samples in the data sets. -means algorithm is effective in clustering largescale data sets, and efforts have been made in order to overcome its disadvantages [142], [218].…”

Section: )mentioning

confidence: 99%

Survey of Clustering Algorithms

Wunsch

2005

IEEE Trans. Neural Netw.

5,047

2,338

View full text Add to dashboard Cite

Abstract-Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.Index Terms-Adaptive resonance theory (ART), clustering, clustering algorithm, cluster validation, neural networks, proximity, self-organizing feature map (SOFM).

show abstract

“…To ensure efficient computation of the contrast measure, we use the onepass k-means clustering strategy introduced in [23] with k = Q. We obtain Q clusters summarizing the data.…”

Section: Efficient Contrast Computationmentioning

confidence: 99%

CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection

Nguyen

Müller

Vreeken

et al. 2013

Proceedings of the 2013 SIAM International Conference on Data Mining

View full text Add to dashboard Cite

In many real world applications data is collected in multi-dimensional spaces, with the knowledge hidden in subspaces (i.e., subsets of the dimensions). It is an open research issue to select meaningful subspaces without any prior knowledge about such hidden patterns. Standard approaches, such as pairwise correlation measures, or statistical approaches based on entropy, do not solve this problem; due to their restrictive pairwise analysis and loss of information in discretization they are bound to miss subspaces with potential clusters and outliers.In this paper, we focus on finding subspaces with strong mutual dependency in the selected dimension set. Chosen subspaces should provide a high discrepancy between clusters and outliers and enhance detection of these patterns. To measure this, we propose a novel contrast score that quantifies mutual correlations in subspaces by considering their cumulative distributionswithout having to discretize the data. In our experiments, we show that these high contrast subspaces provide enhanced quality in cluster and outlier detection for both synthetic and real world data.

show abstract

Efficient disk-based K-means clustering for relational databases

Cited by 79 publications

References 29 publications

Big data analytics: a survey

Big data analytics: a survey

Survey of Clustering Algorithms

CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection

Contact Info

Product

Resources

About