Clustering is a complex unsupervised method used to group most similar observations of a given dataset within the same cluster. To guarantee high efficiency, the clustering process should ensure high accuracy and low complexity. Many clustering methods were developed in various fields depending on the type of application and the data type considered. Categorical clustering considers segmenting a dataset in which the data are categorical and were widely used in many real-world applications. Thus several methods were developed including hard, fuzzy and rough set-based methods. In this survey, more than 30 categorical clustering algorithms were investigated. These methods were classified into hierarchical and partitional clustering methods and classified in terms of their accuracy, precision and recall to identify the most prominent ones. Experimental results show that rough set-based clustering methods provided better efficiency than hard and fuzzy methods. Besides, methods based on the initialization of the centroids also provided good results.
The categorical clustering problem has attracted much attention especially in the last decades since many real world applications produce categorical data. The k-mode algorithm, proposed since 1998, and its multiple variants were widely used in this context. However, they suffer from a great limitation related to the update of the modes in each iteration. The mode in the last step of these algorithms is randomly selected although it is possible to identify many candidate ones. In this paper, a rough density mode selection method is proposed to identify the adequate modes among a list of candidate ones in each iteration of the k-modes. The proposed method, called Density Rough k-Modes (DRk-M) was experimented using real world datasets extracted from the UCI Machine Learning Repository, the Global Terrorism Database (GTD) and a set of collected Tweets. The DRk-M was also compared to many states of the art clustering methods and has shown great efficiency.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.