A Comparison of External Clustering Evaluation Indices in the Context of Imbalanced Data Sets

Souto, Michael Vandesteen Silva; Coelho, André L. V.; Faceli, Katti; Sakata, Tiemi C.; Bonadia, Viviane; Costa, Ivan G.

doi:10.1109/sbrn.2012.25

Cited by 33 publications

(21 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For each example, the first column shows the value obtained with the metric for the left partition, the second column shows the result for the right partition and the third column indicates if the formal constraint is satisfied ( ) or Note that none of current metrics can satisfy all constraints. Indeed, F b 3 satisfies the first 4 F.C., but misses the correct identification of the best partition for the unbalanced case as reported by [4]. However, the proposed modifications F mod&0.9 b 3 (with | x| = 3) and F 0.9 b 3 manage to correctly classify all the formal constraints using the parameter α = 0.9.…”

Section: Formal Constraintsmentioning

confidence: 85%

“…Finally, Cluster size vs. quantity gives higher scores to partitions where few clusters are provided but separates most classes. In addition to these formal constraints, the Unbalanced constraint was recently added by [4] and evaluates if a misclassification is present in a big class or in a small one. This constraint gives better scores when the incorrect classified element is from the biggest class.…”

Section: Formal Constraintsmentioning

confidence: 99%

“…This constraint gives better scores when the incorrect classified element is from the biggest class. Results using the examples proposed by [1] and [4] 4 are shown in Table 1 5 . For each example, the first column shows the value obtained with the metric for the left partition, the second column shows the result for the right partition and the third column indicates if the formal constraint is satisfied ( ) or Note that none of current metrics can satisfy all constraints.…”

Section: Formal Constraintsmentioning

confidence: 99%

“…Each of these constraints evaluate a different situation that must be solved with a good evaluation metric. However, in the particular case of unbalanced datasets, these metrics fail to identify the correct solution [4]. The particularity of an unbalanced dataset is that one of the classes covers most of the document collection.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Adapted B-CUBED Metrics to Unbalanced Datasets

Moreno

Dias

2015

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

B-CUBED metrics have recently been adopted in the evaluation of clustering results as well as in many other related tasks. However, this family of metrics is not well adapted when datasets are unbalanced. This issue is extremely frequent in Web results, where classes are distributed following a strong unbalanced pattern. In this paper, we present a modified version of B-CUBED metrics to overcome this situation. Results in toy and real datasets indicate that the proposed adaptation correctly considers the particularities of unbalanced cases.

show abstract

Section: Formal Constraintsmentioning

confidence: 85%

Section: Formal Constraintsmentioning

confidence: 99%

Section: Formal Constraintsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Adapted B-CUBED Metrics to Unbalanced Datasets

Moreno

Dias

2015

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

show abstract

“…Hence, for our research purpose the external indices could be more robust in comparison of clustering concordance between sample and complete datasets. In their research aiming to carry out the effect of sampling, de Souto et al (2012) also preferred to use the external validity indices for assessing the partitions for highly imbalanced datasets. In our study, since we expect that the cluster densities can be changed by the sampling rates we also assumed that the external indices would be more informative in comparison of the partitions obtained on different sample datasets.…”

Section: External Validity Indices and Clustering Qualitymentioning

confidence: 99%

Efficiency of Random Sampling Based Data Size Reduction on Computing Time and Validity of Clustering in Data Mining

Cebecí¹,

Yildiz²

2016

JAI

View full text Add to dashboard Cite

A B S T R A C TIn data mining, cluster analysis is one of the widely used analytics to discover existing groups in datasets. However, the traditional clustering algorithms become insufficient for the analysis of big data which have been formed with the enormous increase in the amount of collected data in recent years. Therefore, the scalability has been one of the most intensively studied research topics for clustering big data. The parallel clustering algorithms and the Map-Reduce framework based techniques on multiple machines are getting popular in scalability for big data analysis. However, applying the sampling techniques on big datasets could be still alternative or complementary task in order to run the traditional algorithms on single machines. The results obtained in this study showed that the data size reduction by the simple random sampling could be successfully used in cluster analysis for large datasets. The clustering validities by running K-means algorithm on the sample datasets were found as high as those of the complete datasets. Additionally the required execution time for cluster analysis on the sample datasets was significantly shorter than those obtained for the complete datasets.

show abstract

Min‐max kurtosis stratum mean: An improved K‐means cluster initialization approach for microarray gene clustering on multidimensional big data

Pandey

Shukla

2022

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARY Microarray gene clustering is a big data application that employs the K‐means (KM) clustering algorithm to identify hidden patterns, evolutionary relationships, unknown functions and gene trends for disease diagnosis, tissue detection and biological analysis. The selection of initial centroids is a major issue in the KM algorithm because it influences the effectiveness, efficiency and local optima of the cluster. The existing initial centroid initialization algorithm is computationally expensive and degrades cluster quality due to the large dimensionality and interconnectedness of microarray gene data. To deal with this issue, this study proposed the min‐max kurtosis stratum mean (MKSM) algorithm for big data clustering in a single machine environment. The MKSM algorithm uses kurtosis for dimension selection, mean distance for gene relationship identification, and stratification for heterogeneous centroid extraction. The results of the presented algorithm are compared to the state‐of‐the‐art initialization strategy on twelve microarray gene datasets utilizing internal, external and statistical assessment criteria. The experimental results demonstrate that the MKSMKM algorithm reduces iterations, distance computation, data comparison and local optima, and improves cluster performance, effectiveness and efficiency with stable convergence.

show abstract

A Comparison of External Clustering Evaluation Indices in the Context of Imbalanced Data Sets

Cited by 33 publications

References 10 publications

Adapted B-CUBED Metrics to Unbalanced Datasets

Adapted B-CUBED Metrics to Unbalanced Datasets

Efficiency of Random Sampling Based Data Size Reduction on Computing Time and Validity of Clustering in Data Mining

Min‐max kurtosis stratum mean: An improved K‐means cluster initialization approach for microarray gene clustering on multidimensional big data

Contact Info

Product

Resources

About