2022
DOI: 10.48550/arxiv.2203.15267
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Selective inference for k-means clustering

Abstract: We consider the problem of testing for a difference in means between clusters of observations identified via k-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. To overcome this problem, we take a selective inference approach. We propose a finite-sample p-value that controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering, and show that it can be efficiently computed. We apply our pr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 61 publications
0
4
0
Order By: Relevance
“…Whilst our approach has been developed for changepoint problems, the general idea can be applied to other scenarios such as clustering (Gao et al, 2022;Chen and Witten, 2022) or regression tress (Neufeld et al, 2022). For example, current methods for post-selection inference after clustering are based on a test statistic that compares the mean of the cluster, and fixed the projection of the data that is orthogonal to this.…”
Section: Discussionmentioning
confidence: 99%
“…Whilst our approach has been developed for changepoint problems, the general idea can be applied to other scenarios such as clustering (Gao et al, 2022;Chen and Witten, 2022) or regression tress (Neufeld et al, 2022). For example, current methods for post-selection inference after clustering are based on a test statistic that compares the mean of the cluster, and fixed the projection of the data that is orthogonal to this.…”
Section: Discussionmentioning
confidence: 99%
“…We next consider taking a selective inference approach (Fithian and others, 2014;Lee and others, 2016;Taylor and Tibshirani, 2015;Gao and others, 2020;Chen and Witten, 2022) to correct the p-values in (3.5). This involves fitting the same regression model as the naive method, but replacing (3.5) with the conditional probability…”
Section: Selective Inference Through Conditioningmentioning
confidence: 99%
“…The main idea is as follows: to test a hypothesis generated from the data, we should condition on the event that we selected this particular hypothesis. Despite promising applications of this framework to a number of problems, such as inference after regression (Lee et al 2016), changepoint detection (Jewell et al 2022, Hyun et al 2021, clustering (Gao et al 2022, Chen & Witten 2022, Yun & Barber 2023, and outlier detection (Chen & Bien 2020), it suffers from some drawbacks: 1. To perform selective inference, the procedure used to generate the null hypothesis must be fully-specified in advance.…”
Section: Introductionmentioning
confidence: 99%
“…To perform selective inference, the procedure used to generate the null hypothesis must be fully-specified in advance. For instance, if a researcher wishes to cluster the data and then test for a difference in means between the clusters, as in Gao et al (2022) and Chen & Witten (2022), then they must fully specify the clustering procedure (e.g., hierarchical clustering with squared Euclidean distance and complete linkage, cut to obtain K clusters) in advance.…”
Section: Introductionmentioning
confidence: 99%