We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, network topology, and data mining. For a stream of indices in {1, . . . , n}, our algorithm computes a (1 ± ε)-approximation using an optimal O(ε −2 +log(n)) bits of space with 2/3 success probability, where 0 < ε < 1 is given. This probability can be amplified by independent repetition. Furthermore, our algorithm processes each stream update in O(1) worst-case time, and can report an estimate at any point midstream in O(1) worst-case time, thus settling both the space and time complexities simultaneously.We also give an algorithm to estimate the Hamming norm of a stream, a generalization of the number of distinct elements, which is useful in data cleaning, packet tracing, and database auditing. Our algorithm uses nearly optimal space, and has optimal O(1) update and reporting times.
We study high-dimensional distribution learning in an agnostic setting where an adversary is allowed to arbitrarily corrupt an ε-fraction of the samples. Such questions have a rich history spanning statistics, machine learning and theoretical computer science. Even in the most basic settings, the only known approaches are either computationally inefficient or lose dimension-dependent factors in their error guarantees. This raises the following question: Is high-dimensional agnostic distribution learning even possible, algorithmically?In this work, we obtain the first computationally efficient algorithms with dimension-independent error guarantees for agnostically learning several fundamental classes of high-dimensional distributions:(1) a single Gaussian, (2) a product distribution on the hypercube, (3) mixtures of two product distributions (under a natural balancedness condition), and (4) mixtures of spherical Gaussians. Our algorithms achieve error that is independent of the dimension, and in many cases scales nearly-linearly with the fraction of adversarially corrupted samples. Moreover, we develop a general recipe for detecting and correcting corruptions in high-dimensions that may be applicable to many other problems.
We study problems in distribution property testing: Given sample access to one or more unknown discrete distributions, we want to determine whether they have some global property or are ǫ-far from having the property in ℓ 1 distance (equivalently, total variation distance, or "statistical distance"). In this work, we give a novel general approach for distribution testing. We describe two techniques: our first technique gives sample-optimal testers, while our second technique gives matching sample lower bounds. As a consequence, we resolve the sample complexity of a wide variety of testing problems.Our upper bounds are obtained via a modular reduction-based approach. Our approach yields optimal testers for numerous problems by using a standard ℓ 2 -identity tester as a blackbox. Using this recipe, we obtain simple estimators for a wide range of problems, encompassing most problems previously studied in the TCS literature, namely: (1) identity testing to a fixed distribution, (2) closeness testing between two unknown distributions (with equal/unequal sample sizes), (3) independence testing (in any number of dimensions), (4) closeness testing for collections of distributions, and (5) testing histograms. For all of these problems, our testers are sample-optimal, up to constant factors. With the exception of (1), ours are the first sampleoptimal testers for the corresponding problems. Moreover, our estimators are significantly simpler to state and analyze compared to previous results.As an important application of our reduction-based technique, we obtain the first nearly instance-optimal algorithm for testing equivalence between two unknown distributions. The sample complexity of our algorithm depends on the structure of the unknown distributions -as opposed to merely their domain size -and is much better compared to the worst-case optimal ℓ 1 -tester in most natural instances. Moreover, our technique naturally generalizes to other metrics beyond the ℓ 1 -distance. As an illustration of its flexibility, we use it to obtain the first near-optimal equivalence tester under the Hellinger distance.Our lower bounds are obtained via a direct information-theoretic approach: Given a candidate hard instance, our proof proceeds by bounding the mutual information between appropriate random variables. While this is a classical method in information theory, prior to our work, it had not been used in distribution property testing. Previous lower bounds relied either on the birthday paradox, or on moment-matching and were thus restricted to symmetric properties. Our lower bound approach does not suffer from any such restrictions and gives tight sample lower bounds for the aforementioned problems.
We describe a general technique that yields the first Statistical Query lower bounds for a range of fundamental high-dimensional learning problems involving Gaussian distributions. Our main results are for the problems of (1) learning Gaussian mixture models (GMMs), and (2) robust (agnostic) learning of a single unknown Gaussian distribution. For each of these problems, we show a super-polynomial gap between the (information-theoretic) sample complexity and the computational complexity of any Statistical Query algorithm for the problem. Statistical Query (SQ) algorithms are a class of algorithms that are only allowed to query expectations of functions of the distribution rather than directly access samples. This class of algorithms is quite broad: a wide range of known algorithmic techniques in machine learning are known to be implementable using SQs. Moreover, for the unsupervised learning problems studied in this paper, all known algorithms with non-trivial performance guarantees are SQ or are easily implementable using SQs.Our SQ lower bound for Problem (1) is qualitatively matched by known learning algorithms for GMMs. At a conceptual level, this result implies that -as far as SQ algorithms are concerned -the computational complexity of learning GMMs is inherently exponential in the dimension of the latent space -even though there is no such information-theoretic barrier. Our lower bound for Problem (2) implies that the accuracy of the robust learning algorithm in [DKK + 16] is essentially best possible among all polynomial-time SQ algorithms. On the positive side, we also give a new (SQ) learning algorithm for Problem (2) achieving the information-theoretically optimal accuracy, up to a constant factor, whose running time essentially matches our lower bound. Our algorithm relies on a filtering technique generalizing [DKK + 16] that removes outliers based on higher-order tensors.Our SQ lower bounds are attained via a unified moment-matching technique that is useful in other contexts and may be of broader interest. Our technique yields nearly-tight lower bounds for a number of related unsupervised estimation problems. Specifically, for the problems of (3) robust covariance estimation in spectral norm, and (4) robust sparse mean estimation, we establish a quadratic statisticalcomputational tradeoff for SQ algorithms, matching known upper bounds. Finally, our technique can be used to obtain tight sample complexity lower bounds for high-dimensional testing problems. Specifically, for the classical problem of robustly testing an unknown mean (known covariance) Gaussian, our technique implies an information-theoretic sample lower bound that scales linearly in the dimension. Our sample lower bound matches the sample complexity of the corresponding robust learning problem and separates the sample complexity of robust testing from standard (non-robust) testing. This separation is surprising because such a gap does not exist for the corresponding learning problem.In summary, the sample complexity of both versions of th...
We present a novel connection between binary search trees (BSTs) and points in the plane satisfying a simple property. Using this correspondence, we achieve the following results: 3. The first proposal for dynamic optimality not based on splay trees. A natural greedy but offline algorithm was presented by Lucas [1988], and independently by Munro [2000], and was conjectured to be an (additive) approximation of the best binary search tree. We show that there exists an equal-cost online algorithm, transforming the conjecture of Lucas and Munro into the conjecture that the greedy algorithm is dynamically optimal.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.