Bregman divergences are important distance measures that are used extensively in data-driven applications such as computer vision, text mining, and speech processing, and are a key focus of interest in machine learning. Answering nearest neighbor (NN) queries under these measures is very important in these applications and has been the subject of extensive study, but is problematic because these distance measures lack metric properties like symmetry and the triangle inequality. In this paper, we present the first provably approximate nearest-neighbor (ANN) algorithms for a broad sub-class of Bregman divergences under some assumptions. Specifically, we examine Bregman divergences which can be decomposed along each dimension and our bounds also depend on restricting the size of our allowed domain. We obtain bounds for both the regular asymmetric Bregman divergences as well as their symmetrized versions. To do so, we develop two geometric properties vital to our analysis: a reverse triangle inequality (RTI) and a relaxed triangle inequality called µ-defectiveness where µ is a domain-dependent value. Bregman divergences satisfy the RTI but not µ-defectiveness. However, we show that the square root of a Bregman divergence does satisfy µ-defectiveness. This allows us to then utilize both properties in an efficient search data structure that follows the general two-stage paradigm of a ring-tree decomposition followed by a quad tree search used in previous near-neighbor algorithms for Euclidean space and spaces of bounded doubling dimension.Our first algorithm resolves a query for a d-dimensionaltime and O n log d−1 n space and holds for generic µ-defective distance measures satisfying a RTI. Our second algorithm is more specific in analysis to the Bregman divergences and uses a further structural parameter, the maximum ratio of second derivatives over each dimension of our allowed domain (c 0 ). This allows us to locate a (1 + ε)-ANN in O(log n) time and O(n) space, where there is a further (c 0 ) d factor in the big-Oh for the query time.
Bregman divergences are important distance measures that are used in applications such as computer vision, text mining, and speech processing, and are a focus of interest in machine learning due to their information-theoretic properties. There has been extensive study of algorithms for clustering and near neighbor search with respect to these divergences. In all cases, the guarantees depend not just on the data size n and dimensionality d, but also on a structure constant µ ≥ 1 that depends solely on a generating convex function φ and can grow without bound independently. In general, this µ parametrizes the degree to which a given divergence is "asymmetric".In this paper, we provide the first evidence that this dependence on µ might be intrinsic. We focus on the problem of approximate nearest neighbor (ANN) search for Bregman divergences. We show that under the cell probe model, any non-adaptive data structure (like locality-sensitive hashing) for c-approximate near-neighbor search that admits r probes must use space Ω(dn 1+µ/cr ). In contrast for LSH under 1 the best bound is Ω(dn 1+1/cr ).Our result interpolates between several known lower bounds both for LSH-based ANN under 1 as well as the generally harder partial match (Partial Match) problem (in non-adaptive settings). The bounds match the former when µ is small and the latter when µ is Ω(d/ log n). This further strengthens the intuition that Partial Match corresponds to an "asymmetric" version of ANN, as well as opening up the possibility of a new line of attack for lower bounds on Partial Match.Our new tool is a directed variant of the standard boolean noise operator. We prove a generalization of the Bonami-Beckner hypercontractivity inequality (restricted to certain subsets of the Hamming cube), and use this to prove the desired directed isoperimetric inequality that we use in our data structure lower bound. *
We study spectral algorithms for the high-dimensional Nearest Neighbor Search problem (NNS). In particular, we consider a semi-random setting where a dataset P in R d is chosen arbitrarily from an unknown subspace of low dimension k d, and then perturbed by fully d-dimensional Gaussian noise. We design spectral NNS algorithms whose query time depends polynomially on d and log n (where n = |P |) for large ranges of k, d and n. Our algorithms use a repeated computation of the top PCA vector/subspace, and are effective even when the random-noise magnitude is much larger than the interpoint distances in P . Our motivation is that in practice, a number of spectral NNS algorithms outperform the random-projection methods that seem otherwise theoretically optimal on worst case datasets. In this paper we aim to provide theoretical justification for this disparity.
We study coresets for various types of range counting queries on uncertain data. In our model each uncertain point has a probability density describing its location, sometimes defined as k distinct locations. Our goal is to construct a subset of the uncertain points, including their locational uncertainty, so that range counting queries can be answered by just examining this subset. We study three distinct types of queries. RE queries return the expected number of points in a query range. RC queries return the number of points in the range with probability at least a threshold. RQ queries returns the probability that fewer than some threshold fraction of the points are in the range. In both RC and RQ coresets the threshold is provided as part of the query. And for each type of query we provide coreset constructions with approximation-size tradeoffs. We show that random sampling can be used to construct each type of coreset, and we also provide significantly improved bounds using discrepancy-based approaches on axis-aligned range queries.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.