Duong Van Hieu scite author profile

Abstract.Clustering very large datasets is a challenging problem for data mining and processing. MapReduce is considered as a powerful programming framework which significantly reduces executing time by dividing a job into several tasks and executes them in a distributed environment. K-Means which is one of the most used clustering methods and K-Means based on MapReduce is considered as an advanced solution for very large dataset clustering. However, the executing time is still an obstacle due to the increasing number of iterations when there is an increase of dataset size and number of clusters. This paper presents a new approach for reducing the number of iterations of K-Means algorithm which can be applied to very large dataset clustering. This new method can reduce up to 30 percent of iterations while maintaining up to 98 percent accuracy when tested with several very large datasets with real data type attributes. Based on the significant results from the experiments, this paper proposes a new fast K-Means clustering method for very large datasets based on MapReduce combined with a new cutting method (abbreviated to FMR.K-Means).

show abstract

Design, synthesis and bioevaluation of novel 6-substituted aminoindazole derivatives as anticancer agents

Hoang

Luu

et al. 2020

RSC Adv.

View full text Add to dashboard Cite

show abstract

A cell-MST-based method for big dataset clustering on limited memory computers

Hieu

Meesad

2015

View full text Add to dashboard Cite

This paper presents a new clustering algorithm, called Cell-MST-Based Method that is a combination of a Cellbased method and Minimum Spanning Tree based (MST-based) methods. The algorithm is dedicated for Big Datasets on a limited memory computer, especially for thin big datasets which have a small number of attributes but a very large number of instances. Firstly, a Cell-based method converts a big dataset to a small grid of cells in such a way that the required memory to store an edgeweighted graph created from the grid which is less than the available memory of a computer. Then MST-based methods obtain an optimal threshold, estimate the number of clusters and determine the initial centroids. The proposed Cell-MST-based methods can reduce more than 99% of the required memory of the previous similarity-based and MST-based cluster number estimation methods. Moreover, this new Cell-MST-based method also outperforms the quantization error modeling method in terms of executing time and estimated accurate level.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Duong Van Hieu

MapReduce join strategies for key-value storage

Fast K-Means Clustering for Very Large Datasets Based on MapReduce Combined with a New Cutting Method

Design, synthesis and bioevaluation of novel 6-substituted aminoindazole derivatives as anticancer agents

A cell-MST-based method for big dataset clustering on limited memory computers

Contact Info

Product

Resources

About