The limitations in general methods to evaluate clustering will remain difficult to overcome if verifying the clustering validity continues to be based on clustering results and evaluation index values. This study focuses on a clustering process to analyze crisp clustering validity. First, we define the properties that must be satisfied by valid clustering processes and model clustering processes based on program graphs and transition systems. We then recast the analysis of clustering validity as the problem of verifying whether the model of clustering processes satisfies the specified properties with model checking. That is, we try to build a bridge between clustering and model checking. Experiments on several datasets indicate the effectiveness and suitability of our algorithms. Compared with traditional evaluation indices, our formal method can not only indicate whether the clustering results are valid but, in the case the results are invalid, can also detect the objects that have led to the invalidity.
As a core step in clustering analysis, distance measurement results can influence clustering accuracy. Existing measurement methods are mostly based on cluster feature information. However, these cluster features may be insufficient and result in losing data information for clusters containing a number of objects. To improve measurement accuracy, we make full use of the distribution characteristics of objects in clusters, i.e., we use descriptive statistics and the Wilcoxon-Mann-Whitney rank sum test in nonparametric statistics to measure distances during clustering. Furthermore, we propose a two-stage clustering algorithm to improve clustering analysis performance. In terms of avoiding preliminarily assuming the number of clusters, with the proposed distance measurement method, the clustering algorithm can discover clusters with arbitrary shapes and improve clustering accuracy. Experiments with multiple datasets compared with other clustering algorithms illustrate the accuracy and efficiency of the proposed clustering algorithm.
The distance measurement between uncertain data is an important basis for accurate clustering. Taking full advantage of the uncertainty characteristics of the object will help to represent the uncertain data more accurately and calculate its distance. Based on the probability distribution function to represent the characteristics of uncertainty distribution, this paper studies a method for measuring distance between uncertain objects based on stochastic simulation. The effectiveness of the proposed method is verified by experiments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.