We propose a set of highly scalable algorithms for the combinatorial data analysis problem of seriating similarity matrices. Seriation consists of finding a permutation of data instances, such that similar instances are nearby in the ordering. Applications of the seriation problem can be found in various disciplines such as in bioinformatics for genome sequencing, data visualization and exploratory data analysis. Our algorithms attempt to minimize certain p-SUM objectives, which also arise in the problem of envelope reduction of sparse matrices. In particular, we present a set of graduated non-convexity algorithms for vector-based relaxations of the general p-SUM problem for p ∈ 2, 1, 1 2 that can scale to very large problem sizes. Different choices of p emphasize global versus local similarity pattern structure. We conduct a number of experiments to compare our algorithms to various state-of-the-art combinatorial optimization methods on real and synthetic datasets. The experimental results demonstrate that compared to other approaches, the proposed algorithms are very competitive and scale well with large problem sizes.
In this work we propose a highly scalable algorithm for solving the combinatorial data analysis problem of seriation. Seriation is a technique for optimizing a permutation of data instances, with respect to some proximity measure such that nearby instances in the linear arrangement are more similar. One consistent objective function for seriation is the 2-SUM minimization problem, which uses the 2-norm between instance locations to penalize non-zero similarity values, and can be written as a quadratic function of the permutation vector. Recently, two convex relaxations of the 2-SUM problem have been proposed, which can be solved as constrained quadratic programs using interior point methods; however, the interior point solvers become expensive when the problem size increases. In this paper we present a graduated non-convexity method for vector-based relaxations of the 2-SUM that yields better approximate solutions and scales to very large problem sizes. We conduct a number of experiments on real and synthetic datasets. The experimental results demonstrate that our proposed algorithm outperforms other approaches that solve the 2-SUM, and is the only competitive approach that can scale to large problem sizes.
We consider the problem of recovering a circular arrangement of data instances with respect to some proximity measure, such that nearby instances are more similar. Applications of this problem, also referred to as circular seriation, can be found in various disciplines such as genome sequencing, data visualization and exploratory data analysis. Circular seriation can be expressed as a quadratic assignment problem, which is in general an intractable problem. Spectral-based approaches can be used to find approximate solutions, but are shown to perform well only for a specific class of data matrices. We propose a bilevel optimization framework where we employ a spherical embedding approach together with a spectral method for circular ordering in order to recover circular arrangements of the embedded data. Experiments on real and synthetic datasets demonstrate the competitive performance of the proposed method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.