Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes

Akhanlı, Serhat Emre; Hennig, Christian

doi:10.48550/arxiv.2002.01822

Cited by 2 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In all our examples, there is a "true" class label available, and we make use of these in evaluating the clustering methods we consider. However, in most cases in practice there are no true class labels, and even if there are such labels recovering them may not be the purpose of a cluster analysis (Akhanli and Hennig, 2020). As we have emphasized, a main advantage of our method is the ability to specify what aspects of the data define clusters through the choice of random effects used in defining mixed predictive replicates.…”

Section: Datasetsmentioning

confidence: 99%

Bayesian clustering using random effects models and predictive projections

Mao,

Nott

2021

Preprint

View full text Add to dashboard Cite

Linear mixed models are widely used for analyzing hierarchically structured data involving missingness and unbalanced study designs. We consider a Bayesian clustering method that combines linear mixed models and predictive projections. For each observation, we consider a predictive replicate in which only a subset of the random effects is shared between the observation and its replicate, with the remainder being integrated out using the conditional prior. Predictive projections are then defined in which the number of distinct values taken by the shared random effects is finite, in order to obtain different clusters. Integrating out some of the random effects acts as a noise filter, allowing the clustering to be focused on only certain chosen features of the data. The method is inspired by methods for Bayesian model checking, in which simulated data replicates from a fitted model are used for model criticism by examining their similarity to the observed data in relevant ways. Here the predictive replicates are used to define similarity between observations in relevant ways for clustering. To illustrate the way our method reveals aspects of the data at different scales, we consider fitting temporal trends in longitudinal data using Fourier cosine bases with a random effect for each basis function, and different clusterings defined by shared random effects for replicates of low or high frequency terms. The method is demonstrated in a series of real examples.

show abstract

Section: Datasetsmentioning

confidence: 99%

Bayesian clustering using random effects models and predictive projections

Mao,

Nott

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Previous work, including [9], could not conclude superiority of any single validity index. Thus, to obtain a comprehensive understanding of the difficulty of a given dataset, many cluster validity indices could potentially be used in combination [42]. In the context of a fitness function, this could be done in the form of an aggregation of indices or through formulation as a manyobjective problem.…”

Section: Computing the Fitness Of A Datasetmentioning

confidence: 99%

HAWKS: Evolving Challenging Benchmark Sets for Cluster Analysis

Shand,

Allmendinger,

Handl

et al. 2021

Preprint

View full text Add to dashboard Cite

Comprehensive benchmarking of clustering algorithms is rendered difficult by two key factors: (i) the elusiveness of a unique mathematical definition of this unsupervised learning approach and (ii) dependencies between the generating models or clustering criteria adopted by some clustering algorithms and indices for internal cluster validation. Consequently, there is no consensus regarding the best practice for rigorous benchmarking, and whether this is possible at all outside the context of a given application. Here, we argue that synthetic datasets must continue to play an important role in the evaluation of clustering algorithms, but that this necessitates constructing benchmarks that appropriately cover the diverse set of properties that impact clustering algorithm performance. Through our framework, HAWKS, we demonstrate the important role evolutionary algorithms play to support flexible generation of such benchmarks, allowing simple modification and extension. We illustrate two possible uses of our framework: (i) the evolution of benchmark data consistent with a set of hand-derived properties and (ii) the generation of datasets that tease out performance differences between a given pair of algorithms. Our work has implications for the design of clustering benchmarks that sufficiently challenge a broad range of algorithms, and for furthering insight into the strengths and weaknesses of specific approaches.

show abstract

Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes

Cited by 2 publications

References 0 publications

Bayesian clustering using random effects models and predictive projections

Bayesian clustering using random effects models and predictive projections

HAWKS: Evolving Challenging Benchmark Sets for Cluster Analysis

Contact Info

Product

Resources

About