2018
DOI: 10.18637/jss.v083.i13
|View full text |Cite
|
Sign up to set email alerts
|

kamila: Clustering Mixed-Type Data in R and Hadoop

Abstract: In this paper we discuss the challenge of equitably combining continuous (quantitative) and categorical (qualitative) variables for the purpose of cluster analysis. Existing techniques require strong parametric assumptions, or difficult-to-specify tuning parameters. We describe the kamila package, which includes a weighted k-means approach to clustering mixed-type data, a method for estimating weights for mixed-type data (Modha-Spangler weighting), and an additional semiparametric method recently proposed in t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
71
0
1

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 58 publications
(72 citation statements)
references
References 40 publications
0
71
0
1
Order By: Relevance
“…For most data sets with multiple nominal variables, this inevitably leads to small sample sizes within each categorical cell. Consider two typical mixed‐type data sets analysed in Foss & Markatou (): the first, a biomedical data set contains five nominal variables measured on 475 patients, while the second contains five nominal variables measured on about 80 million domestic airline flights in the USA. The distribution of counts within the combinatorial cells for each data set is shown in Figure ; in the biomedical data set, the median number of observations per cell is two, and even in the much larger airline data set, 25% of the cells have count less than 16.…”
Section: Statistical Mixture Modelsmentioning
confidence: 99%
See 1 more Smart Citation
“…For most data sets with multiple nominal variables, this inevitably leads to small sample sizes within each categorical cell. Consider two typical mixed‐type data sets analysed in Foss & Markatou (): the first, a biomedical data set contains five nominal variables measured on 475 patients, while the second contains five nominal variables measured on about 80 million domestic airline flights in the USA. The distribution of counts within the combinatorial cells for each data set is shown in Figure ; in the biomedical data set, the median number of observations per cell is two, and even in the much larger airline data set, 25% of the cells have count less than 16.…”
Section: Statistical Mixture Modelsmentioning
confidence: 99%
“…If an inadequate sample size is suspected, KAMILA incorporates a categorical smoother that can ameliorate these issues in most circumstances. The KAMILA method has been implemented in the R package kamila, as well as in Hadoop, with usage recommendations described in Foss & Markatou ().…”
Section: Statistical Mixture Modelsmentioning
confidence: 99%
“…Although Modha–Spangler clustering accounts for variable significance within the algorithm, it is vulnerable to individual noninformative variables, due to the fact that the single weight does not allow individual variables to be up‐ or downweighted (Foss, Markatou, Ray, & Heching, ). The Modha–Spangler algorithm is implemented in R package kamila (Foss & Markatou, ).…”
Section: Defining Dissimilarity Measures For Mixed Datamentioning
confidence: 99%
“…Clustering heterogenous dataset is a challenging process. The outcome of the analysis gives a significant impact on the interpretation of clusters [1,2,3,4]. Moreover, it demanded excessive computational skills and memory storage due to incorporation of broad categories [5].…”
Section: Introductionmentioning
confidence: 99%
“…The most common approached in treating heterogeneous data is through converting the variables into a single scale of measurement. However, this method may result in information loss [6,7,4]. Meanwhile, conducting a separate cluster analysis can abandon the connection between the variables which can be inappropriate.…”
Section: Introductionmentioning
confidence: 99%