2022
DOI: 10.1109/access.2021.3136435
|View full text |Cite
|
Sign up to set email alerts
|

Toolbox for Distance Estimation and Cluster Validation on Data With Missing Values

Abstract: Missing data are unavoidable in the real-world application of unsupervised machine learning, and their nonoptimal processing may decrease the quality of data-driven models. Imputation is a common remedy for missing values, but directly estimating expected distances have also emerged. Because treatment of missing values is rarely considered in clustering related tasks and distance metrics have a central role both in clustering and cluster validation, we developed a new toolbox that provides a wide range of algo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
14
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 12 publications
(16 citation statements)
references
References 57 publications
2
14
0
Order By: Relevance
“…CTL with multi-estimation for missing data is based on belief function theory [139]. Niemela et al also developed a toolbox that offers a wide range of algorithms for data preprocessing, distance estimation, clustering, and cluster validation, specifically designed to handle missing values in data analysis [140].…”
Section: ) Citation Analysismentioning
confidence: 99%
“…CTL with multi-estimation for missing data is based on belief function theory [139]. Niemela et al also developed a toolbox that offers a wide range of algorithms for data preprocessing, distance estimation, clustering, and cluster validation, specifically designed to handle missing values in data analysis [140].…”
Section: ) Citation Analysismentioning
confidence: 99%
“…The proposed analysis process is composed of a novel combination of reliable unsupervised and supervised data mining and machine learning methods, which have been developed in the earlier research [30,33,4,34,31,54,23,21,55,45,22]. The overall process reads as follows:…”
Section: The Analysis Processmentioning
confidence: 99%
“…The essence behind Step 1 is the availability of robust and reliable clustering method [33,4,21,22] which can tolerate over 30% of missing values [3]. In this unsupervised setting, also the number of clusters is to be estimated, and for this purpose, a set of cluster validation indices, also applicable with missing values, were tested in [27,21,45].…”
Section: The Analysis Processmentioning
confidence: 99%
“…As depicted in Hämäläinen et al [7], Niemelä et al [13], the cluster validation indices are composed of a quotient of estimates of Inter and Intra of a clustering result, i.e., the variability of data within clusters divided by the separation of clusters. Both of these measures are computed with a distance measure which is inhereted from the clustering problem formulation (Hämäläinen et al [8]).…”
Section: Introductionmentioning
confidence: 99%
“…Therefore, a key to reliable cluster validation indices with missing values is how to estimate the distances between the prototypes and the observations. For this purpose, in Niemelä et al [13], the classical partial distance strategy (Gower [6]) was applied with promising results. However, more recently a set of papers have appeared (Eirola et al [3,4], Mesquita et al [12]), which have addressed the distance estimation with missing values for both squared and euclidean (nonsquared) distances with better accuracy than in Gower [6].…”
Section: Introductionmentioning
confidence: 99%