Frey and Dueck (Reports, 16 February 2007, p. 972) described an algorithm termed "affinity propagation" (AP) as a promising alternative to traditional data clustering procedures. We demonstrate that a well-established heuristic for the p-median problem often obtains clustering solutions with lower error than AP and produces these solutions in comparable computation time.F rey and Dueck (1) described an algorithm for analyzing complex data sets termed "affinity propagation" (AP). The algorithm extracts a subset of representative objects or "exemplars" from the complete object set by exchanging real-valued messages between data points. Clusters are formed by assigning each data point to its most similar exemplar. The authors reported that "[a]ffinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time" (1). We demonstrate that an efficient implementation of a 40-year-old heuristic for the well-known p-median model (PMM) often provides lower-error solutions than AP in comparable central processing unit (CPU) time.For consistency with AP in (1), we present the PMM as a sum of similarities maximization problem, while recognizing that this is equivalent to the more common form of minimizing the sum of dissimilarities (e.g., distances or costs). The PMM is a general mathematical problem that can be concisely stated as follows: Given an m × n similarity matrix, S, select p columns from S such that the sum of the maximum values within each row of the selected columns is maximized (2). Thus, each row is effectively assigned to its most similar selected column (exemplar) with the goal of maximizing overall similarity. One classic example of the PMM occurs in facility location planning: Locate p plants such that the total distance (or cost) required to serve m demand points is minimized. In data analysis applications where S is an n × n matrix of negative squared Euclidean distances between objects, clustering the n objects using the PMM corresponds to the selection of p exemplars to minimize error, which is defined as the sum of the squared Euclidean distances of each object to its nearest exemplar.Lagrangian relaxation methods enable the exact solution of PMM instances with n ≤ 500 objects (3, 4). For larger problems, a vertex substitution heuristic (VSH) developed in (5) has been the standard for comparison for nearly four decades. The VSH begins with the random selection of a subset of p exemplars, which is iteratively refined by evaluating the effects of substituting an unselected point for one of the selected exemplars. Frey and Dueck assert that this type of strategy "works well only when the number of clusters is small and chances are good that at least one random initialization is close to a good solution" (1). To the contrary, the VSH is remarkably effective and is often the engine for metaheuristics such as tabu search (6) and variable neighborhood search (7).We compared AP to an efficient implementation of VSH (7) across eight data set...
The p-median clustering model represents a combinatorial approach to partition data sets into disjoint, non-hierarchical groups. Object classes are constructed around exemplars, manifest objects in the data set, with the remaining instances assigned to their closest cluster centers. Effective, state-of-the-art implementations of p-median clustering are virtually unavailable in the popular social and behavioral science statistical software packages. We present p-median clustering, including a detailed description of its mechanics, a discussion of available software programs and their capabilities. Application to a complex structured data set on the perception of food items illustrate p-median clustering.
The monotone homogeneity model (MHM-also known as the unidimensional monotone latent variable model) is a nonparametric IRT formulation that provides the underpinning for partitioning a collection of dichotomous items to form scales. Ellis (Psychometrika 79:303-316, 2014, doi: 10.1007/s11336-013-9341-5 ) has recently derived inequalities that are implied by the MHM, yet require only the bivariate (inter-item) correlations. In this paper, we incorporate these inequalities within a mathematical programming formulation for partitioning a set of dichotomous scale items. The objective criterion of the partitioning model is to produce clusters of maximum cardinality. The formulation is a binary integer linear program that can be solved exactly using commercial mathematical programming software. However, we have also developed a standalone branch-and-bound algorithm that produces globally optimal solutions. Simulation results and a numerical example are provided to demonstrate the proposed method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.