A Unified View on Clustering Binary Data

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. To address this problem, a number of projected clustering algorithms have been proposed. However, most of them encounter difficulties when clusters hide in subspaces with very low dimensionality. These challenges motivate our effort to propose a robust partitional distance-based projected clustering algorithm. The algorithm consists of three phases. The first phase performs attribute relevance analysis by detecting dense and sparse regions and their location in each attribute. Starting from the results of the first phase, the goal of the second phase is to eliminate outliers, while the third phase aims to discover clusters in different subspaces. The clustering process is based on the K-means algorithm, with the computation of distance restricted to subsets of attributes where object values are dense. Our algorithm is capable of detecting projected clusters of low dimensionality embedded in a high-dimensional space and avoids the computation of the distance in the full-dimensional space. The suitability of our proposal has been demonstrated through an empirical study using synthetic and real datasets.

show abstract

“…Given two binary data points z 1 and z 2 , there are four fundamental quantities that can be used to define similarity between the two [35]:…”

Section: Outlier Handlingmentioning

confidence: 99%

Mining Projected Clusters in High-Dimensional Spaces

Bouguessa

Wang

2009

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

show abstract

“…A unified view of binary data clustering has been provided by examining the connections among various methods including entropy-based methods, distance-based methods (e.g., K-means), mixture models, and matrix decomposition [38,39]. In addition, it also shows the equivalence between K-means clustering methods with many other methods on binary data clustering using empirical studies [38,39]. In our experiments, we use K-means as the clustering methods.…”

Section: Methodsmentioning

confidence: 99%

On combining multiple clusterings: an overview and a new perspective

Ogihara

2009

Appl Intell

Self Cite

View full text Add to dashboard Cite

Many problems can be reduced to the problem of combining multiple clusterings. In this paper, we first summarize different application scenarios of combining multiple clusterings and provide a new perspective of viewing the problem as a categorical clustering problem. We then show the connections between various consensus and clustering criteria and discuss the complexity results of the problem. Finally we propose a new method to determine the final clustering. Experiments on kinship terms and clustering popular music from heterogeneous feature sets show the effectiveness of combining multiple clusterings.

show abstract

“…For the solution of clustering problem the traditional algorithms, such as k-means algorithm [19,20], hierarchical clustering, differential evaluation algorithm, particle swarm optimization algorithm, artificial bee colony optimization, ant colony algorithm, and neural network algorithm GEM (Gaussian expectation-maximization), are usually used [21][22][23][24][25][26]. The up-to-date survey of evolutionary algorithms for clustering, especially the partition algorithms, are described in detail in [27].…”

Section: Related Workmentioning

confidence: 99%

Classification of Textual E‐Mail Spam Using Data Mining Techniques

Alguliyev

Alıguliyev

Nazirova

2011

Applied Computational Intelligence and Soft Computing

View full text Add to dashboard Cite

A new method for clustering of spam messages collected in bases of antispam system is offered. The genetic algorithm is developed for solving clustering problems. The objective function is a maximization of similarity between messages in clusters, which is defined by k-nearest neighbor algorithm. Application of genetic algorithm for solving constrained problems faces the problem of constant support of chromosomes which reduces convergence process. Therefore, for acceleration of convergence of genetic algorithm, a penalty function that prevents occurrence of infeasible chromosomes at ranging of values of function of fitness is used. After classification, knowledge extraction is applied in order to get information about classes. Multidocument summarization method is used to get the information portrait of each cluster of spam messages. Classifying and parametrizing spam templates, it will be also possible to define the thematic dependence from geographical dependence (e.g., what subjects prevail in spam messages sent from certain countries). Thus, the offered system will be capable to reveal purposeful information attacks if those occur. Analyzing origins of the spam messages from collection, it is possible to define and solve the organized social networks of spammers.

show abstract

A Unified View on Clustering Binary Data

Cited by 37 publications

References 33 publications

Mining Projected Clusters in High-Dimensional Spaces

Mining Projected Clusters in High-Dimensional Spaces

On combining multiple clusterings: an overview and a new perspective

Classification of Textual E‐Mail Spam Using Data Mining Techniques

Contact Info

Product

Resources

About