Robust information-theoretic clustering

Böhm, Christian; Faloutsos, Christos; Pan, Jia-Yu; Plant, Claudia

doi:10.1145/1150402.1150414

Cited by 37 publications

(23 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The user specifies the model complexity by parameter settings, most importantly by selecting the number of clusters k. Most approaches to parameter-free clustering, e.g. X-Means [16], G-Means [12] and RIC [5] employ information-theoretic criteria to achieve a balance between the complexity of the model and its quality for interpretation. However, these approaches rely on a relatively simple cluster notion.…”

Section: Related Workmentioning

confidence: 99%

Outlier-robust clustering using independent components

Böhm¹,

Faloutsos

Plant

2008

Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data

Self Cite

View full text Add to dashboard Cite

How can we efficiently find a clustering, i.e. a concise description of the cluster structure, of a given data set which contains an unknown number of clusters of different shape and distribution and is contaminated by noise? Most existing clustering methods are restricted to the Gaussian cluster model and are very sensitive to noise. If the cluster content follows a non-Gaussian distribution and/or the data set contains a few outliers belonging to no cluster, then the computed data distribution does not match well the true data distribution, or an unnaturally high number of clusters is required to represent the true data distribution of the data set. In this paper we propose OCI (Outlier-robust Clustering using Independent Components), a clustering method which overcomes these problems by (1) applying the exponential power distribution (EPD) as cluster model which is a generalization of Gaussian, uniform, Laplacian and many other distribution functions, (2) applying the Independent Component Analysis (ICA) for both determining the main directions inside a cluster as well as finding split planes in a top-down clustering approach, and (3) defining an efficient and effective filter for outliers, based on EPD and ICA. Our method is parameter-free and as a top-down clustering approach very efficient. An extensive experimental evaluation shows both the accuracy of the obtained clustering result as well as the efficiency of our method.

show abstract

Section: Related Workmentioning

confidence: 99%

Outlier-robust clustering using independent components

Böhm¹,

Faloutsos

Plant

2008

Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data

Self Cite

View full text Add to dashboard Cite

show abstract

“…Some information-theoretic algorithms have recently been proposed with the major focus on avoiding crucial parameter settings in clustering, e.g. [24,15,7,8]. As SONAR, these algorithms rely on the Minimum Description Length principle [13], which allows model selection by regarding clustering as a data compression problem.…”

Section: Related Work and Discussionmentioning

confidence: 99%

SONAR: Signal De-mixing for Robust Correlation Clustering

Plant¹

2011

Proceedings of the 2011 SIAM International Conference on Data Mining

Self Cite

View full text Add to dashboard Cite

Clustering is one of the most fundamental challenges in data mining. We identified three core problems which turn finding a natural grouping of a data set into a difficult task: First, clusters may exist in arbitrarily oriented subspaces of various dimensionality (also known as correlation clusters). Secondly, the cluster structure may be hidden by noise and outliers. Finally, the number, size and density of the clusters is usually unknown which makes the parametrization of existing approaches very difficult. In this paper, we address these three problems by combining ideas from information theory and blind signal source separation. Our algorithm is inspired by the idea of an active sonar that reveals hidden objects by sending echo pings with various frequencies and from different directions. Analogously, our algorithm SONAR very efficiently generates primitive pre-clusters and considers exactly these pre-clusters as echo pings. Each echo of a ping is a mixture of the signals of the true clusters. Independent component analysis (ICA) allows us to decompose the mixed signals into statistically independent response patterns. We combine the idea of signal de-mixing with the Minimum Description Length (MDL) principle to allow an outlier-robust and parameter-free detection of the true clusters. Extensive experiments demonstrate the following assets of SONAR: Outlier-robust detection of correlation clusters of various density and subspace orientation, requiring no difficult input parameters, and scalability to large data sets.

show abstract

“…The MDL principle was used for vector quantization (Bischof et al 1999), where superfluous vectors were detected via MDL. Böhm et al (2006) used MDL to optimise a given partitioning by choosing specific models for each of the parts. These model-classes need to be pre-defined, requiring premonition of the component models in the data.…”

Section: Related Workmentioning

confidence: 99%

Identifying the components

Leeuwen

Vreeken

Siebes

2009

Data Min Knowl Disc

View full text Add to dashboard Cite

Most, if not all, databases are mixtures of samples from different distributions. Transactional data is no exception. For the prototypical example, supermarket basket analysis, one also expects a mixture of different buying patterns. Households of retired people buy different collections of items than households with young children. Models that take such underlying distributions into account are in general superior to those that do not. In this paper we introduce two MDL-based algorithms that follow orthogonal approaches to identify the components in a transaction database. The first follows a model-based approach, while the second is data-driven. Both are parameter-free: the number of components and the components themselves are chosen such that the combined complexity of data and models is minimised. Further, neither prior knowledge on the distributions nor a distance metric on the data is required. Experiments with both methods show that highly characteristic components are identified.

show abstract

Robust information-theoretic clustering

Cited by 37 publications

References 15 publications

Outlier-robust clustering using independent components

Outlier-robust clustering using independent components

SONAR: Signal De-mixing for Robust Correlation Clustering

Identifying the components

Contact Info

Product

Resources

About