Model-Based Clustering for Image Segmentation and Large Datasets Via Sampling

Wehrens, Ron; Buydens, L.M.C.; Fraley, Chris; Raftery, Adrian E.

doi:10.21236/ada459638

Cited by 11 publications

(18 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In a basic sampling approach, a random sample of the data is used to calculate the clusters and then an additional "E" (expectation) step is used to classify the remaining items. This approach can be improved by building multiple models from the initial sample and then running through several steps of the EM algorithm to fit the whole dataset to these models (Wehrens et al, 2004) or by looking to create new clusters for observations in the full dataset that are fit badly by the sample clusters (Fraley et al, 2005). In addition, as per other clustering approaches, parallel methods have been developed (Kriegel et al, 2005;McNicholas et al, 2010).…”

Section: Modern Large Scale Segmentation Approachesmentioning

confidence: 99%

Marketing analytics: Methods, practice, implementation, and links to other fields

Ghose

2019

Expert Systems with Applications

View full text Add to dashboard Cite

Marketing analytics is a diverse field, with both academic researchers and practitioners coming from a range of backgrounds including marketing, expert systems, statistics, and operations research. This paper provides an integrative review at the boundary of these areas. The aim is to give researchers in the intelligent and expert systems community the opportunity to gain a broad view of the marketing analytics area and provide a starting point for future interdisciplinary collaboration. The topics of visualization, segmentation, and class prediction are featured. Links between the disciplines are emphasized. For each of these topics, a historical overview is given, starting with initial work in the 1960s and carrying through to the present day. Recent innovations for modern, large, and complex "big data" sets are described. Practical implementation advice is given, along with a directory of open source R routines for implementing marketing analytics techniques.

show abstract

Section: Modern Large Scale Segmentation Approachesmentioning

confidence: 99%

Marketing analytics: Methods, practice, implementation, and links to other fields

Ghose

2019

Expert Systems with Applications

View full text Add to dashboard Cite

show abstract

“…The performance on noisy data has been demonstrated solely in [40] with data containing only 5% of noise, while we show results for varying noise proportions up to 90%. Third, as most clustering algorithms including SPC have time complexity of O(n 2 ), the subsample size we consider here is O( √ n), which is much smaller than the subsample sizes used in the previous work [2,11,40,12,21,26,30]. This is important in the context of big data applications and inherently large datasets, for which only algorithms with O(n) operations would be computationally feasible.…”

Section: Introductionmentioning

confidence: 96%

“…Later, Fraley and Raftery [11] elaborate on subsample clustering and discriminant analysis for large data and discuss a modification of the simple random subsampling with the goal of finding small, tight clusters. A number of other clustering methods were subsequently developed, following a similar idea [25,40,12,21,26,30]. All of these methods are geared mainly towards computational efficiency, and several were also developed to find small clusters in large datasets [9,25,12,26,30].…”

Section: Introductionmentioning

confidence: 99%

“…We summarize the novel contributions of our work in comparison to existing subsampling-based clustering methods. First, while all of the existing methods assume that the number of clusters is given or require that some initial estimate of the number of clusters is provided by the user [9,25,30,2,11,40,12], ISSPC determines the number of clusters on its own through significance tests on a solution path. This improves usability and at the same time reduces user bias.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Iterative subsampling in solution path clustering of noisy big data

Marchetti

Zhou

2016

Statistics and Its Interface

View full text Add to dashboard Cite

We develop an iterative subsampling approach to improve the computational efficiency of our previous work on solution path clustering (SPC). The SPC method achieves clustering by concave regularization on the pairwise distances between cluster centers. This clustering method has the important capability to recognize noise and to provide a short path of clustering solutions; however, it is not sufficiently fast for big datasets. Thus, we propose a method that iterates between clustering a small subsample of the full data and sequentially assigning the other data points to attain orders of magnitude of computational savings. The new method preserves the ability to isolate noise, includes a solution selection mechanism that ultimately provides one clustering solution with an estimated number of clusters, and is shown to be able to extract small tight clusters from noisy data. The method's relatively minor losses in accuracy are demonstrated through simulation studies, and its ability to handle large datasets is illustrated through applications to gene expression datasets. An R package, SPClustering, for the SPC method with iterative subsampling is available at

show abstract

“…Model-based clustering is an idiom that is often used to describe the application of a mixture model for clustering. Dating at least as far back as Wolfe (1963), interest in model-based clustering is increasing steadily in application areas such as food authenticity, social networks, and microarray gene expression analyses (e.g., Yeung et al, 2001;Wehrens et al, 2004;Krivitsky et al, 2009;McNicholas and Murphy, 2010). In model-based clustering applications, it is common to fit many mixture models within a family (cf.…”

Section: Introductionmentioning

confidence: 99%

Mixture model averaging for clustering

Wei

McNicholas

2014

Adv Data Anal Classif

View full text Add to dashboard Cite

In mixture model-based clustering applications, it is common to fit several models from a family and report clustering results from only the 'best' one. In such circumstances, selection of this best model is achieved using a model selection criterion, most often the Bayesian information criterion. Rather than throw away all but the best model, we average multiple models that are in some sense close to the best one, thereby producing a weighted average of clustering results. Two (weighted) averaging approaches are considered: averaging the component membership probabilities and averaging models. In both cases, Occam's window is used to determine closeness to the best model and weights are computed within a Bayesian model averaging paradigm. In some cases, we need to merge components before averaging; we introduce a method for merging mixture components based on the adjusted Rand index. The effectiveness of our model-based clustering averaging approaches is illustrated using a family of Gaussian mixture models on real and simulated data.

show abstract

Model-Based Clustering for Image Segmentation and Large Datasets Via Sampling

Cited by 11 publications

References 7 publications

Marketing analytics: Methods, practice, implementation, and links to other fields

Marketing analytics: Methods, practice, implementation, and links to other fields

Iterative subsampling in solution path clustering of noisy big data

Mixture model averaging for clustering

Contact Info

Product

Resources

About