Bayesian Variable Selection in Clustering High-Dimensional Data

Tadesse, Mahlet G.; Sha, Naijun; Vannucci, Marina

doi:10.1198/016214504000001565

Cited by 215 publications

(222 citation statements)

References 24 publications

Supporting

Mentioning

220

Contrasting

Order By: Relevance

“…Data-clustering algorithms have been developed for more than half a century (1). Significant advances in the last two decades include spectral clustering (2-4), generalizations of classic center-based methods (5, 6), mixture models (7, 8), mean shift (9), affinity propagation (10), subspace clustering (11-13), nonparametric methods (14, 15), and feature selection (16)(17)(18)(19)(20).Despite these developments, no single algorithm has emerged to displace the k -means scheme and its variants (21). This is despite the known drawbacks of such center-based methods, including sensitivity to initialization, limited effectiveness in high-dimensional spaces, and the requirement that the number of clusters be set in advance.…”

mentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Robust continuous clustering

Shah

Koltun²

2017

Proc. Natl. Acad. Sci. U.S.A.

159

106

View full text Add to dashboard Cite

Clustering is a fundamental procedure in the analysis of scientific data. It is used ubiquitously across the sciences. Despite decades of research, existing clustering algorithms have limited effectiveness in high dimensions and often require tuning parameters for different domains and datasets. We present a clustering algorithm that achieves high accuracy across multiple domains and scales efficiently to high dimensions and large datasets. The presented algorithm optimizes a smooth continuous objective, which is based on robust statistics and allows heavily mixed clusters to be untangled. The continuous nature of the objective also allows clustering to be integrated as a module in end-to-end feature learning pipelines. We demonstrate this by extending the algorithm to perform joint clustering and dimensionality reduction by efficiently optimizing a continuous global objective. The presented approach is evaluated on large datasets of faces, handwritten digits, objects, newswire articles, sensor readings from the Space Shuttle, and protein expression levels. Our method achieves high accuracy across all datasets, outperforming the best prior algorithm by a factor of 3 in average rank.clustering | data analysis | unsupervised learning C lustering is one of the fundamental experimental procedures in data analysis. It is used in virtually all natural and social sciences and has played a central role in biology, astronomy, psychology, medicine, and chemistry. Data-clustering algorithms have been developed for more than half a century (1). Significant advances in the last two decades include spectral clustering (2-4), generalizations of classic center-based methods (5, 6), mixture models (7, 8), mean shift (9), affinity propagation (10), subspace clustering (11-13), nonparametric methods (14, 15), and feature selection (16)(17)(18)(19)(20).Despite these developments, no single algorithm has emerged to displace the k -means scheme and its variants (21). This is despite the known drawbacks of such center-based methods, including sensitivity to initialization, limited effectiveness in high-dimensional spaces, and the requirement that the number of clusters be set in advance. The endurance of these methods is in part due to their simplicity and in part due to difficulties associated with some of the new techniques, such as additional hyperparameters that need to be tuned, high computational cost, and varying effectiveness across domains. Consequently, scientists who analyze large high-dimensional datasets with unknown distribution must maintain and apply multiple different clustering algorithms in the hope that one will succeed. Books have been written to guide practitioners through the landscape of data-clustering techniques (22).We present a clustering algorithm that is fast, easy to use, and effective in high dimensions. The algorithm optimizes a clear continuous objective, using standard numerical methods that scale to massive datasets. The number of clusters need not be known in advance.The operation of the algorithm ...

show abstract

mentioning

confidence: 99%

mentioning

confidence: 99%

Robust continuous clustering

Shah

Koltun²

2017

Proc. Natl. Acad. Sci. U.S.A.

159

106

View full text Add to dashboard Cite

show abstract

“…Following ref. 36, we assign larger probabilities to the merges of similar branches, such that where S is the total number of branches, and S 1 is the number of nonempty branches. Hitherto, we completely specified the branch splitting move of the RJMCMC (see SI Appendix, Text 3 for other moves).…”

Section: Resultsmentioning

confidence: 99%

“…Following ref. 36, we denote the transition probability of this move as qðθ new jθ old Þ, and assign…”

Section: Resultsmentioning

confidence: 99%

Time-variant clustering model for understanding cell fate decisions

Huang

Cao

Biase

et al. 2014

Proc. Natl. Acad. Sci. U.S.A.

View full text Add to dashboard Cite

Both spatial characteristics and temporal features are often the subjects of concern in physical, social, and biological studies. This work tackles the clustering problems for time course data in which the cluster number and clustering structure change with respect to time, dubbed time-variant clustering. We developed a hierarchical model that simultaneously clusters the objects at every time point and describes the relationships of the clusters between time points. The hidden layer of this model is a generalized form of branching processes. A reversible-jump Markov Chain Monte Carlo method was implemented for model inference, and a feature selection procedure was developed. We applied this method to explore an open question in preimplantation embryonic development. Our analyses using single-cell gene expression data suggested that the earliest cell fate decision could start at the 4-cell stage in mice, earlier than the commonly thought 8-to 16-cell stage. These results together with independent experimental data from singlecell RNA-seq provided support against a prevailing hypothesis in mammalian development.clustering | time | branching process | embryonic development | cell fate

show abstract

“…This is attractive, for example, when the number of genes on a microarray is the relevant sample size, thus allowing flexible semi-parametric representations. Such approaches are discussed, among others, in Broet P (2002), Dahl (2003), or Tadesse et al (2005). The latter exploit the clustering implicitely defined by the mixture model.…”

Section: Introductionmentioning

confidence: 99%