Clustering in the Presence of Scatter

Maitra, Ranjan; Ramler, Ivan

doi:10.1111/j.1541-0420.2008.01064.x

Cited by 20 publications

(27 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The k-means algorithm does not make distributional assumptions but may be cast in a semi-parametric framework [12], [41]. We now show that the following holds even without the Gaussian distributional assumptions that underlie Result 1:…”

Section: Introductionmentioning

confidence: 70%

An efficientk‐means‐type algorithm for clustering datasets with incomplete records

Lithio

Maitra

2018

Statistical Analysis

Self Cite

View full text Add to dashboard Cite

The k-means algorithm is arguably the most popular nonparametric clustering method but cannot generally be applied to datasets with incomplete records. The usual practice then is to either impute missing values under an assumed missing-completelyat-random mechanism or to ignore the incomplete records, and apply the algorithm on the resulting dataset. We develop an efficient version of the k-means algorithm that allows for clustering in the presence of incomplete records. Our extension is called kmmeans and reduces to the k-means algorithm when all records are complete. We also provide initialization strategies for our algorithm and methods to estimate the number of groups in the dataset. Illustrations and simulations demonstrate the efficacy of our approach in a variety of settings and patterns of missing data. Our methods are also applied to the analysis of activation images obtained from a functional Magnetic Resonance Imaging experiment. Index TermsAMELIA, CARP, FMRI, IMPUTATION, JUMP STATISTIC, k-MEANS++, k-POD, MICE, SOFT CONSTRAINTS, SDSS arXiv:1802.08363v2 [stat.ML] 8 Sep 2018 algorithm using partial distances. However, the repeated application of k-means at every iteration is computationally expensive. The literature is also sparse on estimating the number of groups K for data with incomplete records. This paper develops an efficient k-means-type clustering algorithm called k m -means that accommodates incomplete records and generalizes the algorithm of [38] that is popular in the statistical literature and software. Expressions for the objective function and its changes following the cluster reassignment of an observation play central roles in our generalization of the [38] algorithm. Section II also provides an initialization strategy for k m -means and an adaptation of the jump statistic [39] for estimating the number of groups. Section III comprehensively evaluates our methodology through a series of large-scale simulation experiments for datasets of different clustering complexities, sizes, numbers of groups, and with different missingness mechanisms and proportions. Section IV uses our methods to find the types of activated cerebral regions from several singletask functional Magnetic Resonance Imaging (fMRI) experiments. We conclude with some discussion in Section V. This paper also has an online supplement having additional illustrations on performance evaluations and other preliminary data analysis. Figures in the supplement referred to in this paper have the prefix "S-".

show abstract

Section: Introductionmentioning

confidence: 70%

An efficientk‐means‐type algorithm for clustering datasets with incomplete records

Lithio

Maitra

2018

Statistical Analysis

Self Cite

View full text Add to dashboard Cite

show abstract

“…These drawbacks have been tackled by more recent methods. Maitra and Ramler (2009), e.g., proposed a generalization of the k-means algorithm that explicitly considers scattered points. Some sophisticated grouping algorithms were proposed that only require specifying, e.g., a maximal cluster size (Scharl and Leisch, 2006), a minimal cluster size (Manley M. Schäfer et al / Computational Statistics and Data Analysis ( ) -et al, 2008, relying on point trajectories over time) or both a minimal cluster size and an effective maximal cluster radius (Ester et al, 1996;Ankerst et al, 1999).…”

Section: Introductionmentioning

confidence: 99%

A Bayesian mixture model to quantify parameters of spatial clustering

Schäfer

Radon

Klein

et al. 2015

Computational Statistics & Data Analysis

View full text Add to dashboard Cite

“…Popular methods include hierarchical clustering [Eisen et al (1998)], K -means [Dudoit and Fridlyand (2002)], mixture model-based approaches [Xie, Pan and Shen (2008); McLachlan, Bean and Peel (2002)] and nonparametric approaches [Qin (2006)], for analysis of single transcriptomic study. Resampling and ensemble methods have been used to improve stability of the clustering analysis [Kim et al (2009); Swift et al (2004)] or to pursue tight clusters by leaving scattered samples that are different from major clusters [Tseng (2007); Tseng and Wong (2005); Maitra and Ramler (2009)]. Witten and Tibshirani (2010) proposed a sparse K -means algorithm that can effectively select gene features and perform sample clustering simultaneously.…”

Section: Introductionmentioning

confidence: 99%

Integrative sparse $K$-means with overlapping group lasso in genomic applications for disease subtype discovery

Huo

Tseng

2017

Ann. Appl. Stat.

View full text Add to dashboard Cite

Cancer subtypes discovery is the first step to deliver personalized medicine to cancer patients. With the accumulation of massive multi-level omics datasets and established biological knowledge databases, omics data integration with incorporation of rich existing biological knowledge is essential for deciphering a biological mechanism behind the complex diseases. In this manuscript, we propose an integrative sparse K-means (is-K means) approach to discover disease subtypes with the guidance of prior biological knowledge via sparse overlapping group lasso. An algorithm using an alternating direction method of multiplier (ADMM) will be applied for fast optimization. Simulation and three real applications in breast cancer and leukemia will be used to compare is-K means with existing methods and demonstrate its superior clustering accuracy, feature selection, functional annotation of detected molecular features and computing efficiency.

show abstract

Clustering in the Presence of Scatter

Cited by 20 publications

References 33 publications

An efficientk‐means‐type algorithm for clustering datasets with incomplete records

An efficientk‐means‐type algorithm for clustering datasets with incomplete records

A Bayesian mixture model to quantify parameters of spatial clustering

Integrative sparse $K$-means with overlapping group lasso in genomic applications for disease subtype discovery

Contact Info

Product

Resources

About