Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such "exemplars" can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called "affinity propagation," which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.C lustering data based on a measure of similarity is a critical step in scientific data analysis and in engineering systems. A common approach is to use data to learn a set of centers such that the sum of squared errors between data points and their nearest centers is small. When the centers are selected from actual data points, they are called "exemplars." The popular k-centers clustering technique (1) begins with an initial set of randomly selected exemplars and iteratively refines this set so as to decrease the sum of squared errors. k-centers clustering is quite sensitive to the initial selection of exemplars, so it is usually rerun many times with different initializations in an attempt to find a good solution. However, this works well only when the number of clusters is small and chances are good that at least one random initialization is close to a good solution. We take a quite different approach and introduce a method that simultaneously considers all data points as potential exemplars. By viewing each data point as a node in a network, we devised a method that recursively transmits real-valued messages along edges of the network until a good set of exemplars and corresponding clusters emerges. As described later, messages are updated on the basis of simple formulas that search for minima of an appropriately chosen energy function. At any point in time, the magnitude of each message reflects the current affinity that one data point has for choosing another data point as its exemplar, so we call our method "affinity propagation." Figure 1A illustrates how clusters gradually emerge during the message-passing procedure.
We present experimental results demonstrating that our method can better recover functionally-relevant clusterings in mRNA expression data than standard clustering techniques, including hierarchical agglomerative clustering, and we show that by computing probabilities instead of point estimates, our method avoids converging to poor solutions.
Affinity propagation (AP) can be viewed as a generalization of the vertex substitution heuristic (VSH), whereby probabilistic exemplar substitutions are performed concurrently. Although results on small data sets (≤900 points) demonstrate that VSH is competitive with AP, we found VSH to be prohibitively slow for moderate-to-large problems, whereas AP was much faster and could achieve lower error.A ffinity propagation (AP) is an algorithm that clusters data and identifies exemplar data points that can be used for summarization and subsequent analysis (1). Dozens of clustering algorithms have been invented in the past 40 years, but in (1) we compared AP with three commonly used methods and found that AP could find solutions with lower error and do so much more quickly. Brusco and Köhn (2) compared AP with the best of 20 runs of a randomly initialized vertex substitution heuristic (VSH) described in 1997 (3), which is based on a previously introduced method (4). They found that for some small data sets (≤900 data points), VSH achieves lower error than AP in a similar amount of time. We subsequently confirmed those results but found no factual errors in our original report. Interestingly, when we studied larger, more complex data sets, we found that AP can achieve lower error than VSH in a fraction of the amount of time (5). VSH took ∼10 days to find 454 clusters in 17,770 Netflix movies, whereas AP took ∼2 hours and achieved lower error.As explained in our original report (1), regardless of whether or not the measure of data similarity is symmetric, "[e]xactly minimizing [AP's cost function] is computationally intractable, because a special case of this minimization problem is the NP-hard k-median problem" (also known as the p-median model, or PMM).[Also see (6).] Consequently, it is expected that different algorithms may work better for different data sets. In fact, we reported results using a brute force method that can exactly solve trivially small problems (e.g., 70 points and six clusters) in a minute or two on any modern computer. We also pointed out that linear programming relaxations (i.e., Lagrangian relaxations) have been used when the data set is not too large (7).We were curious about how close AP and VSH could get to the best possible (exact) solution, so we studied the original Olivetti face data set (n = 400 data points) for which the exact clustering solution could be found. Because the error for VSH varies depending on the random initialization, in all of our experiments we used the best of 20 runs. Figure 1A plots squared error versus the number of exemplars (k) for AP (single run), VSH (best of 20 runs), k-centers clustering (all of one million runs), and the exact solution (7). Both AP and VSH performed substantially better than k-centers clustering and, for practical purposes, achieved the exact solution (for k < 135, the difference in error between AP or VSH and the exact solution is less than the error reduction obtained by including an additional exemplar) (5).The results presented by Brusco...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.