Hierarchical model-based clustering of large datasets through fractionation and refractionation

Tantrum, Jeremy; Murua, Alejandro; Stuetzle, Werner

doi:10.1145/775047.775074

Cited by 20 publications

(4 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Posse (2001) proposed a method based on the minimum spanning tree for obtaining the initial partition. A different approach called fractionation was proposed by Tantrum, Murua, and Stuetzle (2002) in the context of hierarchical model-based clustering, where the complete data set is split up randomly into smaller sets, which are clustered individually. The process is then repeated, but with the smaller sets formed by aggregated clusters rather than randomly.…”

Section: Discussionmentioning

confidence: 99%

“…Various approaches to the problem of clustering large data sets have been proposed, including initialization by clustering a sample of the data (Banfield and Raftery 1993;Fayyad and Smyth 1996;Maitra 2001), and using an initial crude partitioning of the entire data set (Posse 2001;Tantrum, Murua, and Stuetzle 2002). The simplest and perhaps most widely applied approach is to apply the clustering method first to a small simple random sample from the data, and then apply the resulting estimated model to the full data set using discriminant analysis (Banfield and Raftery 1993).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Model-Based Clustering for Image Segmentation and Large Datasets Via Sampling

Wehrens¹,

Buydens²,

Fraley³

et al. 2003

View full text Add to dashboard Cite

Abstract:The rapid increase in the size of data sets makes clustering all the more important to capture and summarize the information, at the same time making clustering more difficult to accomplish. If model-based clustering is applied directly to a large data set, it can be too slow for practical application. A simple and common approach is to first cluster a random sample of moderate size, and then use the clustering model found in this way to classify the remainder of the objects. We show that, in its simplest form, this method may lead to unstable results. Our experiments suggest that a stable method with better performance can be obtained with two straightforward modifications to the simple sampling method: several tentative models are identified from the sample instead of just one, and several EM steps are used rather than just one E step to classify the full data set. We find that there are significant gains from increasing the size of the sample up to about 2,000, but not from further increases. These conclusions are based on the application of several alternative strategies to the segmentation of three different multispectral images, and to several simulated data sets.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Model-Based Clustering for Image Segmentation and Large Datasets Via Sampling

Wehrens¹,

Buydens²,

Fraley³

et al. 2003

View full text Add to dashboard Cite

show abstract

“…Also, there are no methods for choosing the number of clusters in model-based hierarchical clustering that are comparable to those available for mixture models. However, this approach could be combined with the mixture modeling approach as a starting scheme for a strategy similar to Strategy W. Tantrum et al (2002) extended model-based hierarchical clustering to large data sets through "refractionation", which splits the data up into many subsets or "fractions". The fractions are clustered by model-based hierarchical clustering into a fixed number of groups and then summarized by their means into meta-observations.…”

Section: Discussionmentioning

confidence: 99%

Incremental Model-Based Clustering for Large Datasets With Small Clusters

Fraley¹,

Raftery²,

Wehrens³

2003

View full text Add to dashboard Cite

Clustering is often useful for analyzing and summarizing information within large datasets. Model-based clustering methods have been found to be effective for determining the number of clusters, dealing with outliers, and selecting the best clustering method in datasets that are small to moderate in size. For large datasets, current model-based clustering methods tend to be limited by memory and time requirements and the increasing difficulty of maximum likelihood estimation. They may fit too many clusters in some portions of the data and/or miss clusters containing relatively few observations. We propose an incremental approach for data that can be processed as a whole in memory, which is relatively efficient computationally and has the ability to find small clusters in large datasets. The method starts by drawing a random sample of the data, selecting and fitting a clustering model to the sample, and extending the model to the full dataset by additional EM iterations. New clusters are then added incrementally, initialized with the observations that are poorly fit by the current model. We demonstrate the effectiveness of this method by applying it to simulated data, and to image data where its performance can be assessed visually.

show abstract

“…There have been several recent advances in extending the normal mixture model to large datasets [2,12].…”

Section: Introductionmentioning

confidence: 99%

Assessment and pruning of hierarchical model based clustering

Tantrum¹,

Murua²,

Stuetzle³

2003

Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '03

Self Cite

View full text Add to dashboard Cite

The goal of clustering is to identify distinct groups in a dataset. The basic idea of model-based clustering is to approximate the data density by a mixture model, typically a mixture of Gaussians, and to estimate the parameters of the component densities, the mixing fractions, and the number of components from the data. The number of distinct groups in the data is then taken to be the number of mixture components, and the observations are partitioned into clusters (estimates of the groups) using Bayes' rule. If the groups are well separated and look Gaussian, then the resulting clusters will indeed tend to be "distinct" in the most common sense of the word -contiguous, densely populated areas of feature space, separated by contiguous, relatively empty regions. If the groups are not Gaussian, however, this correspondence may break down; an isolated group with a non-elliptical distribution, for example, may be modeled by not one, but several mixture components, and the corresponding clusters will no longer be well separated. We present methods for assessing the degree of separation between the components of a mixture model and between the corresponding clusters. We also propose an algorithm for pruning the cluster tree generated by hierarchical model-based clustering. The algorithm starts with the tree corresponding to the mixture model chosen by the Bayesian Information Criterion. It then progressively merges clusters that do not appear to correspond to different modes of the data density.

show abstract

Hierarchical model-based clustering of large datasets through fractionation and refractionation

Cited by 20 publications

References 13 publications

Model-Based Clustering for Image Segmentation and Large Datasets Via Sampling

Model-Based Clustering for Image Segmentation and Large Datasets Via Sampling

Incremental Model-Based Clustering for Large Datasets With Small Clusters

Assessment and pruning of hierarchical model based clustering

Contact Info

Product

Resources

About