How slow is the
            <i>k</i>
            -means method?

Arthur, David; Vassilvitskii, Sergei

doi:10.1145/1137856.1137880

Cited by 483 publications

(462 citation statements)

References 12 publications

Supporting

Mentioning

452

Contrasting

Unclassified

Order By: Relevance

“…The optimum number of the clusters may vary based on the properties of the dataset, such as the geometric distribution, statistical measures, and neighborhood measures [42,48]. In general, though, it can be reported that the increase in the number of clusters yields higher computational costs with lower condensing ratio and may also cause higher classification accuracy.…”

Section: Batch Datasetsmentioning

confidence: 99%

A novel approach for extracting ideal exemplars by clustering for massivetime-ordered datasets

Ertuğrul¹

2017

Turk J Elec Eng & Comp Sci

View full text Add to dashboard Cite

Abstract:The number and length of massive datasets have increased day by day and this yields more complex machine learning stages due to the high computational costs. To decrease the computational cost many methods were proposed in the literature such as data condensing, feature selection, and filtering. Although clustering methods are generally employed to divide samples into groups, another way of data condensing is by determining ideal exemplars (or prototypes), which can be used instead of the whole dataset. In this study, first the efficiency of traditional data condensing by clustering approach was confirmed according to obtained accuracies and condensing ratios in 9 different synthetic or real batch datasets. This approach was then improved to be employed in time-ordered datasets. In order to validate the proposed approach, 23 different real time-ordered datasets were used in experiments. Achieved mean RMSEs were 0.27 and 0.29 by employing the condensed (mean condensed ratio was 97.17%) and the whole datasets, respectively. Obtained results showed that higher accuracy rates and condensing ratios were achieved by the proposed approach.

show abstract

Section: Batch Datasetsmentioning

confidence: 99%

A novel approach for extracting ideal exemplars by clustering for massivetime-ordered datasets

Ertuğrul¹

2017

Turk J Elec Eng & Comp Sci

View full text Add to dashboard Cite

show abstract

“…Very few theoretical guarantees are known about Lloyd's method or its variants. The convergence rate of Lloyd's method has recently been investigated in [10,22,2] and in particular, [2] shows that Lloyd's method can require a superpolynomial number of iterations to converge.…”

Section: Introductionmentioning

confidence: 99%

The Effectiveness of Lloyd-Type Methods for the k-Means Problem

Ostrovsky

Rabani

Schulman³

et al. 2006

2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06)

251

286

View full text Add to dashboard Cite

We investigate variants of Lloyd's heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd's heuristic that quickly lead to provably near-optimal clustering solutions when applied to well-clusterable instances. This is the first performance guarantee for a variant of Lloyd's heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd's method. In addition, our other algorithms are faster on well-clusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloyd-type iteration.

show abstract

“…For clustering the tweets, we use the K-means algorithm [23], a very popular algorithm for clustering due its speed and simplicity [24,25]. Basically, it has a single parameter to set: k, the number of clusters to find.…”

Section: Topic Identification Based On Clusteringmentioning

confidence: 99%

Topic Identification and Categorization of Public Information in Community-Based Social Media

Kusumawardani

Basri

2017

J. Phys.: Conf. Ser.

View full text Add to dashboard Cite

Abstract. This paper presents a work on a semi-supervised method for topic identification and classification of short texts in the social media, and its application on tweets containing dialogues in a large community of dwellers in a city, written mostly in Indonesian. These dialogues comprise a wealth of information about the city, shared in real-time. We found that despite the high irregularity of the language used, and the scarcity of suitable linguistic resources, a meaningful identification of topics could be performed by clustering the tweets using the K-Means algorithm. The resulting clusters are found to be robust enough to be the basis of a classification. On three grouping schemes derived from the clusters, we get accuracy of 95.52%, 95.51%, and 96.7 using linear SVMs, reflecting the applicability of applying this method for generating topic identification and classification on such data.

show abstract

How slow is the k -means method?

Cited by 483 publications

References 12 publications

A novel approach for extracting ideal exemplars by clustering for massivetime-ordered datasets

A novel approach for extracting ideal exemplars by clustering for massivetime-ordered datasets

The Effectiveness of Lloyd-Type Methods for the k-Means Problem

Topic Identification and Categorization of Public Information in Community-Based Social Media

Contact Info

Product

Resources

About