Internal cluster validity index is a powerful tool for evaluating clustering performance. The study on internal cluster validity indices for categorical data has been a challenging task due to the difficulty in measuring distance between categorical attribute values. While some efforts have been made, they ignore the relationship between different categorical attribute values and the detailed distribution information between data objects. To solve these problems, we propose a novel index called Categorical data cluster Utility Based On Silhouette (CUBOS). Specifically, we first make clear the superiority of the paradigm of Silhouette index in exploring the details of clustering results. Then, we raise the Improved Distance metric for Categorical data (IDC) inspired by Category Distance to measure distance between categorical data exactly. Finally, the paradigm of Silhouette index and IDC are combined to construct the CUBOS, which can overcome the aforementioned shortcomings and produce more accurate evaluation results than other baselines, as shown by the experimental results on several UCI datasets.
The classic algorithm for high dimensional sparse data clustering, CABOSFV, cannot adjust the sets once generated, which leads to the final clustering result impacted by the preceding clustering result. This paper proposes ADJ-CABOSFV that can adjust the sets clustered by CABOSFV and the objects in the same set clustered by ADJ-CABOSFV are more similar without increasing the number of parameters. The experiments on UCI data sets show that ADJ-CABOSFV maintains superiority on high-dimensional sparse data of binary variables, and the clustering quality is better than the classic CABOSFV.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.