In cluster analysis, it can be useful to interpret the partition built from the data in the light of external categorical variables which were not directly involved to cluster the data. An approach is proposed in the model-based clustering context to select a model and a number of clusters which both fit the data well and take advantage of the potential illustrative ability of the external variables. This approach makes use of the integrated joint likelihood of the data and the partitions at hand, namely the model-based partition and the partitions associated to the external variables. It is noteworthy that each mixture model is fitted by the maximum likelihood methodology to the data, excluding the external variables which are used to select a relevant mixture model only. Numerical experiments illustrate the promising behaviour of the derived criterion.
Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.
Abstract:This work is part of a supermarket chain expansion study and is intended to cluster the existent outlets in order to support the evaluation of outlet performance and new outlet site location. To overcome the curse of dimensionality (a large number of attributes for a very small number of existing outlets) experts' knowledge is considered in the clustering process. Three alternative approaches are compared for this end, the experts being required to: 1-a priori: provide values for perceived dissimilarities between pairs of outlets; 2-a posteriori: evaluate results from alternative regression trees; 3-interactively: help to select base variables and evaluate results from alternative dendrograms. The later approach provided the best results according to the marketing experts.
In this study, we address the discriminant factors of website trust. We specifically build sets of propositional rules that can be used to predict the level of trustworthiness of a site. Focusing on initial trust, a survey was designed to assess site characteristics observed by the respondent and his/her perceptions around appearance, reputation, fulfillment, and security. By exploring data, we look for the most favorable rules classifiers among decision trees as well as classical and dominance-based rough sets. A heuristic aiming to derive simpler classifiers is also proposed. The experimental setup considers diverse groups of attributes (predictors) for the extraction of rules. Results obtained are compared by taking into account predictive ability and parsimony of rules' sets. Finally, the selected sets help bring light on how consumers process site information and suggest specific recommendations for e-commerce vendors.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.