2011
DOI: 10.2172/1025410
|View full text |Cite
|
Sign up to set email alerts
|

Sampling Within k-Means Algorithm to Cluster Large Datasets

Abstract: Due to current data collection technology, our ability to gather data has surpassed our ability to analyze it. In particular, k-means, one of the simplest and fastest clustering algorithms, is ill-equipped to handle extremely large datasets on even the most powerful machines. Our new algorithm uses a sample from a dataset to decrease runtime by reducing the amount of data analyzed. We perform a simulation study to compare our sampling based k-means to the standard k-means algorithm by analyzing both the speed … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
13
0

Year Published

2013
2013
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 17 publications
(13 citation statements)
references
References 2 publications
0
13
0
Order By: Relevance
“…Making use of the whole observed data into the clustering process, the behavior of the k-means can be considered as an exhaustive analysis, in other words, the clustering of the whole unit measurements may occupy high time and memory consuming, while a small number of sampling units can be practically fast, easily and accurately representative of the whole observed population. In the state of art, few authors used random sampling in order to avoid the use of the whole set of available data [3,10,11]. Among them, Bradley [11] has proposed an approach based on multiple small random sub-samples to estimate refined initial centers for the k-means clustering.…”
Section: Srs-k-meansmentioning
confidence: 99%
See 1 more Smart Citation
“…Making use of the whole observed data into the clustering process, the behavior of the k-means can be considered as an exhaustive analysis, in other words, the clustering of the whole unit measurements may occupy high time and memory consuming, while a small number of sampling units can be practically fast, easily and accurately representative of the whole observed population. In the state of art, few authors used random sampling in order to avoid the use of the whole set of available data [3,10,11]. Among them, Bradley [11] has proposed an approach based on multiple small random sub-samples to estimate refined initial centers for the k-means clustering.…”
Section: Srs-k-meansmentioning
confidence: 99%
“…We show in this paper the effect of the sampling procedures in the clustering process. Simple random sampling (SRS) is the mostly used procedure in which the data points are assumed to be iid [9] and there are only a few results available when the sampling design is different [10,11]. However, in some ap-plications, such as the one explained in [3,12], using ranked set sampling (RSS), may be cheaper and result in better and more informative samples from the population.…”
Section: Introductionmentioning
confidence: 99%
“…The behaviour "in the limit" is of practical relevance, as some researchers, especially in the realm of database mining, propose to cluster "sufficiently large" samples instead of the whole database content. See for example Bejarano et al [6] on k-means accelerating via subsampling.…”
Section: Introductionmentioning
confidence: 99%
“…Bejarano et al 2011). Dolnicar's (2002) results indicate that despite market segmentation being used extensively in the field of tourism research, the fundamental question of how many variables should be used for a certain number of respondents has not yet been explicitly considered, and practically no guidance is available to data analysts with respect to the sample size required.…”
Section: Introductionmentioning
confidence: 99%
“…These results also indicate that in tourism research the sample sizes are at best modest and there is no need to employ subsampling strategies to reduce the computational burden in the segmentation analysis due to large data sets, as suggested for other areas of research where millions of observations are available (cf. Bejarano et al 2011). Dolnicar's (2002) results indicate that despite market segmentation being used extensively in the field of tourism research, the fundamental question of how many variables should be used for a certain number of respondents has not yet been explicitly considered, and practically no guidance is available to data analysts with respect to the sample size required.…”
Section: Introductionmentioning
confidence: 99%