A fast and effective partitional clustering algorithm for large categorical datasets using a k -means based approach

Salem, Semeh Ben; Naouali, Sami; Chtourou, Zied

doi:10.1016/j.compeleceng.2018.04.023

Cited by 57 publications

(38 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Clustering methods can be classified into five types: hierarchical [2,3], partitional [4][5][6][7][8][9][10][11][12][13][14][15][16], density-based [17,18], gridbased [19] or model-based methods [20]. The aim of cluster analysis is to partition a dataset composed of N observations embedded in a d-dimensional space into k distinct clusters.…”

Section: Introductionmentioning

confidence: 99%

A rough set based algorithm for updating the modes in categorical clustering

Salem

Naouali²,

Chtourou

2021

Int. J. Mach. Learn. & Cyber.

Self Cite

View full text Add to dashboard Cite

The categorical clustering problem has attracted much attention especially in the last decades since many real world applications produce categorical data. The k-mode algorithm, proposed since 1998, and its multiple variants were widely used in this context. However, they suffer from a great limitation related to the update of the modes in each iteration. The mode in the last step of these algorithms is randomly selected although it is possible to identify many candidate ones. In this paper, a rough density mode selection method is proposed to identify the adequate modes among a list of candidate ones in each iteration of the k-modes. The proposed method, called Density Rough k-Modes (DRk-M) was experimented using real world datasets extracted from the UCI Machine Learning Repository, the Global Terrorism Database (GTD) and a set of collected Tweets. The DRk-M was also compared to many states of the art clustering methods and has shown great efficiency.

show abstract

Section: Introductionmentioning

confidence: 99%

A rough set based algorithm for updating the modes in categorical clustering

Salem

Naouali²,

Chtourou

2021

Int. J. Mach. Learn. & Cyber.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The cost time is highly increase by processing a transactional dataset, where the quality of outcomes is affected by analyzing the numerous iterations in the process. The feature selection and dimensionality reduction techniques are developed to address this issues, where the main aim is to remove the noisy, redundant and irrelevant information by preprocessing the data [11,12]. Recently, K-means and its variants are highly used for clustering large datasets because of their higher scalability and efficiency.…”

Section: Introductionmentioning

confidence: 99%

A Similarity based K-Means Clustering Technique for Categorical Data in Data Mining Application

Kumar¹,

Kanavalli²

2021

IJIES

View full text Add to dashboard Cite

Clustering plays a major role in the data mining application, because it divides and groups the data effectively. In the pattern analysis, two major challenges occur in real-life applications that includes handling the categorical data and the availability of correctly labeled data. According to the characteristics of homogeneity, the clustering techniques are designed to group the unlabeled data. Some important issues such as high memory utilization, time consumption, overhead, computation complexity and less effective results are present in various existing algorithms of numerical data. Therefore, the research study implemented clustering techniques based on the similarity of categorical data. Simultaneously, the attributes of inter and intra-clusters' similarities are identified, and then the performance of proposed method is improved by integrating those similarities. The noises are also removed by performing the pre-processing techniques, so the similarity between noise-free elements are estimated. Once these similarities are identified, the insignificant attributes are removed and the relevant attributes are chosen from the preprocessed elements. The overhead is reduced by developing the Similarity-based K-means Clustering (SKC) approach for clustering the attributes that depends on divergence distance. The efficiency of SKC is tested in the experimental analysis by means of precision, f-measure, accuracy, error rate of clustering and recall. The results state that the developed study achieved 98.45% accuracy for the publicly available dataset when comparing with the existing techniques: variations of Particle Swarm Optimization (PSO) and semi-supervised clustering system.

show abstract

“…The k-means has advantages, i.e. it is easy to implement grouping a large dataset, and with stable performance over different problems (Ben Salem et al, 2018 [1]; Chakraborty and Das, 2018 [2]). However, the clustering results of k-means depends on a certain number of clusters as inputs.…”

Section: Introductionmentioning

confidence: 99%

RETRACTED CHAPTER: U-Control Chart Based Differential Evolution Clustering for Determining the Number of Cluster in k-Means

Silva

Lezama

Varela

et al. 2019

Green, Pervasive, and Cloud Computing

View full text Add to dashboard Cite

The automatic clustering differential evolution (ACDE) is one of the clustering methods that are able to determine the cluster number automatically. However, ACDE still makes use of the manual strategy to determine k activation threshold thereby affecting its performance. In this study, the ACDE problem will be ameliorated using the u-control chart (UCC) then the cluster number generated from ACDE will be fed to k-means. The performance of the proposed method was tested using six public datasets from the UCI repository about academic efficiency (AE) and evaluated with Davies Bouldin Index (DBI) and Cosine Similarity (CS) measure. The results show that the proposed method yields excellent performance compared to prior researches.

show abstract

A fast and effective partitional clustering algorithm for large categorical datasets using a k -means based approach

Cited by 57 publications

References 17 publications

A rough set based algorithm for updating the modes in categorical clustering

A rough set based algorithm for updating the modes in categorical clustering

A Similarity based K-Means Clustering Technique for Categorical Data in Data Mining Application

RETRACTED CHAPTER: U-Control Chart Based Differential Evolution Clustering for Determining the Number of Cluster in k-Means

Contact Info

Product

Resources

About