Given a set of points labeled with k labels, we introduce the heat map sorting problem as reordering and merging the points and dimensions while preserving the clusters (labels). A cluster is preserved if it remains connected, i.e., if it is not split into several clusters and no two clusters are merged.We prove the problem is NP-hard and we give a fixed-parameter algorithm with a constant number of rounds in the massively parallel computation model, where each machine has a sublinear memory and the total memory of the machines is linear. We give an approximation algorithm for a NP-hard special case of the problem. We empirically compare our algorithm with k-means and density-based clustering (DBSCAN) using a dimensionality reduction via locality-sensitive hashing on several directed and undirected graphs of email and computer networks.
The k-center problem is to choose a subset of size k from a set of n points such that the maximum distance from each point to its nearest center is minimized. Let Q = {Q1, . . . , Qn} be a set of polygons or segments in the region-based uncertainty model, in which each Qi is an uncertain point, where the exact locations of the points in Qi are unknown. The geometric objects such as segments and polygons can be models of a point set. We define the uncertain version of the k-center problem as a generalization in which the objective is to find k points from Q to cover the remaining regions of Q with minimum or maximum radius of the cluster to cover at least one or all exact instances of each Qi, respectively. We modify the region-based model to allow multiple points to be chosen from a region, and call the resulting model the aggregated uncertainty model. All these problems contain the point version as a special case, so they are all NP-hard with a lower bound 1.822 for the approximation factor. We give approximation algorithms for uncertain k-center of a set of segments and polygons. We also have implemented some of our algorithms on a data-set to show our theoretical performance guarantees can be achieved in practice.
Given a set of k strings I, their longest common subsequence (LCS) is the string with the maximum length that is a subset of all the strings in I. A data-structure for this problem preprocesses I into a data-structure such that the LCS of a set of query strings Q with the strings of I can be computed faster. Since the problem is NP-hard for arbitrary k, we allow an error that allows some characters to be replaced by other characters. We define the approximation version of the problem with an extra input m, which is the length of the regular expression (regex) that describes the input, and the approximation factor is the logarithm of the number of possibilities in the regex returned by the algorithm, divided by the logarithm regex with the minimum number of possibilities. Then, we use a tree data-structure to achieve sublinear-time LCS queries.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.