Gaussian mixture model-based clustering is now a standard tool to estimate some hypothetical underlying partition of a single dataset. In this paper, we aim to cluster several different datasets at the same time in a context where underlying populations, even though different, are not completely unrelated: All individuals are described by the same features and partitions of identical meaning are expected. Justifying from some natural arguments a stochastic linear link between the components of the mixtures associated to each dataset, we propose some parsimonious and meaningful models for a so-called simultaneous clustering method. Maximum likelihood mixture parameters, subject to the linear link constraint, can be easily estimated by a Generalized Expectation Maximization (GEM) algorithm that we describe. Some promising results are obtained in a biological context where simultaneous clustering outperforms independent clustering for partitioning three different subspecies of birds. Further results on ornithological data show that the proposed strategy is robust to the relaxation of the exact descriptor concordance which is one of its main assumptions. Keywords Stochastic linear link • Gaussian mixture • Model-based clustering • EM algorithm • Model selection • Biological features 1 Introduction Clustering aims to separate a sample into classes in order to reveal some hidden but meaningful structure in data. In a probabilistic context it is standard practice to
Statisticians are already aware that any modelling process issue (exploration, prediction) is wholly data unit dependent, to the extend that it should be impossible to provide a statistical outcome without specifying the couple (unit,model). In this work, this general principle is formalized with a particular focus in model-based clustering and co-clustering in the case of possibly mixed data types (continuous and/or categorical and/or counting features), being also the opportunity to revisit what the related data units are. Such a formalization allows to raise three important spots: (i) the couple (unit,model) is not identifiable so that different interpretations unit/model of the same whole modelling process are always possible; (ii) combining different "classical" units with different "classical" models should be an interesting opportunity for a cheap, wide and meaningful enlarging of the whole modelling process family designed by the couple (unit,model); (iii) if necessary, this couple, up to the non identifiability property, could be selected by any traditional model selection criterion. Some experiments on real data sets illustrate in detail practical benefits from the previous three spots.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.