We analyze LSH Forest [BCG05]-a popular heuristic for the nearest neighbor search-and show that a careful yet simple modification of it outperforms "vanilla" LSH algorithms. The end result is the first instance of a simple, practical algorithm that provably leverages data-dependent hashing to improve upon data-oblivious LSH.Here is the entire algorithm for the d-dimensional Hamming space. The LSH Forest, for a given dataset, applies a random permutation to all the d coordinates, and builds a trie on the resulting strings. In our modification, we further augment this trie: for each node, we store a constant number of points close to the mean of the corresponding subset of the dataset, which are compared to any query point reaching that node. The overall data structure is simply several such tries sampled independently.While the new algorithm does not quantitatively improve upon the best data-dependent hashing algorithms from [AR15] (which are known to be optimal), it is significantly simpler, being based on a practical heuristic, and is provably better than the best LSH algorithm for the Hamming space [IM98,HIM12].
We present an algorithm for approximating the edit distance between two strings of length n in time n 1+ , for any > 0, up to a constant factor. Our result completes the research direction set forth in the recent breakthrough paper [CDG + 18], which showed the first constant-factor approximation algorithm with a (strongly) sub-quadratic running time. Several recent results have shown near-linear complexity under different restrictions on the inputs (eg, when the edit distance is close to maximal, or when one of the inputs is pseudo-random). In contrast, our algorithm obtains a constant-factor approximation in near-linear running time for any input strings.
We study the problem of discrete distribution testing in the two-party setting. For example, in the standard closeness testing problem, Alice and Bob each have t samples from, respectively, distributions a and b over [n], and they need to test whether a = b or a, b are -far (in the 1 distance) for some fixed > 0. This is in contrast to the well-studied one-party case, where the tester has unrestricted access to samples of both distributions, for which optimal bounds are known for a number of variations. Despite being a natural constraint in applications, the two-party setting has evaded attention so far.We address two fundamental aspects of the two-party setting: 1) what is the communication complexity, and 2) can it be accomplished securely, without Alice and Bob learning extra information about each other's input. Besides closeness testing, we also study the independence testing problem, where Alice and Bob have t samples from distributions a and b respectively, which may be correlated; the question is whether a, b are independent of -far from being independent. Our contribution is three-fold:
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.