Scalable K-Means by ranked retrieval

Bröder, Arndt; Garcia-Pueyo, Lluís; Josifovski, Vanja; Vassilvitskii, Sergei; Venkatesan, S.

doi:10.1145/2556195.2556260

Cited by 52 publications

(47 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Comparison with IQ-means: IQ-means is an accelerated version of ranked-retrieval [8] that skips distance computations when vectors are placed far away from centers. IQ-means can be the fastest clustering method for large-scale data.…”

Section: Discussionmentioning

confidence: 99%

“…Such approximated k-means methods include approximated search [31], hierarchical search [27], approximated bounds [38], and batch-based methods [26,34]. If the size of the input data is large, subset-based methods [2,8] can achieve the fastest performance. ese methods only treat a subset of the input vectors (i.e., vectors close to each center), making the computation e cient.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

PQk-means

Matsui

Ogaki²,

Yamasaki

et al. 2017

Proceedings of the 25th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Data clustering is a fundamental operation in data analysis. For handling large-scale data, the standard k-means clustering method is not only slow, but also memory-ine cient. We propose an efcient clustering method for billion-scale feature vectors, called PQk-means. By rst compressing input vectors into short productquantized (PQ) codes, PQk-means achieves fast and memory-e cient clustering, even for high-dimensional vectors. Similar to k-means, PQk-means repeats the assignment and update steps, both of which can be performed in the PQ-code domain. Experimental results show that even short-length (32 bit) PQ-codes can produce competitive results compared with k-means. is result is of practical importance for clustering in memory-restricted environments. Using the proposed PQk-means scheme, the clustering of one billion 128D SIFT features with K = 10 5 is achieved within 14 hours, using just 32 GB of memory consumption on a single computer.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

PQk-means

Matsui

Ogaki²,

Yamasaki

et al. 2017

Proceedings of the 25th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…The related research includes SNOB [62], MCLUST [63], k-medoids, and k-means related research [64,65]. Density-based partitioning methods attempt to discover low-dimensional data, which is denseconnected, known as spatial data.…”

Section: Clustering Clustering Algorithmsmentioning

confidence: 99%

Data Mining for the Internet of Things: Literature Review and Challenges

Chen

Deng

Wan

et al. 2015

International Journal of Distributed Sensor Networks

405

215

View full text Add to dashboard Cite

The massive data generated by the Internet of Things (IoT) are considered of high business value, and data mining algorithms can be applied to IoT to extract hidden information from data. In this paper, we give a systematic way to review data mining in knowledge view, technique view, and application view, including classification, clustering, association analysis, time series analysis and outlier analysis. And the latest application cases are also surveyed. As more and more devices connected to IoT, large volume of data should be analyzed, the latest algorithms should be modified to apply to big data. We reviewed these algorithms and discussed challenges and open research issues. At last a suggested big data mining system is proposed.

show abstract

“…In recent years, continuous efforts have been devoted to looking for effective solutions that are still workable in webscale data. Representative works are [19], [20], [21], [22], [23], [24], [25], [26], [27], [28]. However, most of the k-means variants achieve high speed efficiency while sacrificing the clustering quality.…”

Section: Introductionmentioning

confidence: 99%

Fast k-Means Based on k-NN Graph

Deng

Zhao

2018

2018 IEEE 34th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

In the era of big data, k-means clustering has been widely adopted as a basic processing tool in various contexts. However, its computational cost could be prohibitively high as the data size and the cluster number are large. It is well known that the processing bottleneck of k-means lies in the operation of seeking closest centroid in each iteration. In this paper, a novel solution towards the scalability issue of k-means is presented. In the proposal, k-means is supported by an approximate k-nearest neighbors graph. In the k-means iteration, each data sample is only compared to clusters that its nearest neighbors reside. Since the number of nearest neighbors we consider is much less than k, the processing cost in this step becomes minor and irrelevant to k. The processing bottleneck is therefore overcome. The most interesting thing is that k-nearest neighbor graph is constructed by iteratively calling the fast k-means itself. Comparing with existing fast k-means variants, the proposed algorithm achieves hundreds to thousands times speed-up while maintaining high clustering quality. As it is tested on 10 million 512-dimensional data, it takes only 5.2 hours to produce 1 million clusters. In contrast, to fulfill the same scale of clustering, it would take 3 years for traditional k-means.

show abstract

Scalable K-Means by ranked retrieval

Cited by 52 publications

References 28 publications

PQk-means

PQk-means

Data Mining for the Internet of Things: Literature Review and Challenges

Fast k-Means Based on k-NN Graph

Contact Info

Product

Resources

About