This work presents an efficient implementation of affinity propagation (AP) on clusters of graphical processing units (GPUs). AP is a state-of-the-art method for finding exemplars in data sets described by similarity matrices. It is typically employed in crisp clustering applications. However, when finding exemplars in an n-pattern data set with dense, non-metric similarities, AP performs iterative processing of three n n floating point matrices. One of them stores the similarities, and the other two store the values that will ultimately pinpoint the exemplars. For large similarity matrices, AP is therefore computationally expensive. Although matrix operations of AP are well suited for GPUs, its memory footprint limits the size of tasks that can be solved on one unit. We present, however, a decomposition scheme for AP that distributes the calculations over multiple GPUs, with low communication-to-computation ratio. Because of this favorable communication pattern, our implementation finds exemplars in large, dense similarity data efficiently, even when GPUs are connected by a slow network. Furthermore, by combining global device memory of multiple GPUs, it can solve problems that would not fit in a single unit.