Sampling Within k-Means Algorithm to Cluster Large Datasets

Bejarano, Jeremy; Bose, Koushiki; Brannan, Tyler; Thomas, Anita; Adragni, Kofi P.; Neerchal, Nagaraj K.; Ostrouchov, George

doi:10.2172/1025410

Cited by 17 publications

(13 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Making use of the whole observed data into the clustering process, the behavior of the k-means can be considered as an exhaustive analysis, in other words, the clustering of the whole unit measurements may occupy high time and memory consuming, while a small number of sampling units can be practically fast, easily and accurately representative of the whole observed population. In the state of art, few authors used random sampling in order to avoid the use of the whole set of available data [3,10,11]. Among them, Bradley [11] has proposed an approach based on multiple small random sub-samples to estimate refined initial centers for the k-means clustering.…”

Section: Srs-k-meansmentioning

confidence: 99%

“…We show in this paper the effect of the sampling procedures in the clustering process. Simple random sampling (SRS) is the mostly used procedure in which the data points are assumed to be iid [9] and there are only a few results available when the sampling design is different [10,11]. However, in some ap-plications, such as the one explained in [3,12], using ranked set sampling (RSS), may be cheaper and result in better and more informative samples from the population.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Ranked k-means clustering for terahertz image segmentation

Ayech

Ziou

2015

2015 IEEE International Conference on Image Processing (ICIP)

View full text Add to dashboard Cite

It is known that k-means clustering is especially sensitive to initial starting centers. In this paper, we propose an original version of k-means for the segmentation of Terahertz images, called ranked-k-means, which is essentially less sensitive to the initialization of the centers. We present the ranked set sampling design and explain how to reformulate the kmeans technique under the ranked sample to estimate the expected centers as well as the clustering of the observed data. Our clustering approach is tested on various Terahertz images. Experimental results show that k-means based on the ranked sample is more efficient than other clustering techniques.

show abstract

Section: Srs-k-meansmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Ranked k-means clustering for terahertz image segmentation

Ayech

Ziou

2015

2015 IEEE International Conference on Image Processing (ICIP)

View full text Add to dashboard Cite

show abstract

“…The behaviour "in the limit" is of practical relevance, as some researchers, especially in the realm of database mining, propose to cluster "sufficiently large" samples instead of the whole database content. See for example Bejarano et al [6] on k-means accelerating via subsampling.…”

Section: Introductionmentioning

confidence: 99%

On the Consistency of k-means++ algorithm

Kłopotek

2020

View full text Add to dashboard Cite

We prove in this paper that the expected value of the objective function of the k-means++ algorithm for samples converges to population expected value. As k-means++, for samples, provides with constant factor approximation for k-means objectives, such an approximation can be achieved for the population with increase of the sample size.This result is of potential practical relevance when one is considering using subsampling when clustering large data sets (large data bases).

show abstract

“…Bejarano et al 2011). Dolnicar's (2002) results indicate that despite market segmentation being used extensively in the field of tourism research, the fundamental question of how many variables should be used for a certain number of respondents has not yet been explicitly considered, and practically no guidance is available to data analysts with respect to the sample size required.…”

Section: Introductionmentioning

confidence: 99%

“…These results also indicate that in tourism research the sample sizes are at best modest and there is no need to employ subsampling strategies to reduce the computational burden in the segmentation analysis due to large data sets, as suggested for other areas of research where millions of observations are available (cf. Bejarano et al 2011). Dolnicar's (2002) results indicate that despite market segmentation being used extensively in the field of tourism research, the fundamental question of how many variables should be used for a certain number of respondents has not yet been explicitly considered, and practically no guidance is available to data analysts with respect to the sample size required.…”

Section: Introductionmentioning

confidence: 99%

Required Sample Sizes for Data-Driven Market Segmentation Analyses in Tourism

Dolničar

Grün

Leisch

et al. 2013

Journal of Travel Research

187

106

View full text Add to dashboard Cite

Data analysts in industry and academia make heavy use of market segmentation analysis to develop tourism knowledge and select commercially attractive target segments. Within academic research alone, approximately 5% of published articles use market segmentation. However, the validity of data-driven market segmentation analyses depends on having available a sample of adequate size. Moreover, no guidance exists for determining what an adequate sample size is. In the present simulation study using artificial data of known structure, the impact of the difficulty of the segmentation task on the required sample size is analyzed in dependence of the number of variables in the segmentation base. Under all simulated data circumstances, a sample size of 70 times the number of variables proves to be adequate. This finding is of substantial practical importance because it will provide guidance to data analysts in academia and industry who wish to conduct reliable and valid segmentation studies.

show abstract

Sampling Within k-Means Algorithm to Cluster Large Datasets

Cited by 17 publications

References 2 publications

Ranked k-means clustering for terahertz image segmentation

Ranked k-means clustering for terahertz image segmentation

On the Consistency of k-means++ algorithm

Required Sample Sizes for Data-Driven Market Segmentation Analyses in Tourism

Contact Info

Product

Resources

About