Let P be a set (called points), Q be a set (called queries) and a function f : P ×Q → [0, ∞) (called cost). For an error parameter > 0, a set S ⊆ P with a weight function w : P → [0, ∞) is an ε-coreset if s∈S w(s)f (s, q) approximates p∈P f (p, q) up to a multiplicative factor of 1 ± ε for every given query q ∈ Q. Coresets are used to solve fundamental problems in machine learning of streaming and distributed data.We construct coresets for the k-means clustering of n input points, both in an arbitrary metric space and d-dimensional Euclidean space. For Euclidean space, we present the first coreset whose size is simultaneously independent of both d and n. In particular, this is the first coreset of size o(n) for a stream of n sparse points in a d ≥ n dimensional space (e.g. adjacency matrices of graphs). We also provide the first generalizations of such coresets for handling outliers. For arbitrary metric spaces, we improve the dependence on k to k log k and present a matching lower bound.For M -estimator clustering (special cases include the well-known k-median and k-means clustering), we introduce a new technique for converting an offline coreset construction to the streaming setting. Our method yields streaming coreset algorithms requiring the storage of O(S + k log n) points, where S is the size of the offline coreset. In comparison, the previous state-of-the-art was the merge-and-reduce technique that required O(S log 2a+1 n) points, where a is the exponent in the offline construction's dependence on −1 . For example, combining our offline and streaming results, we produce a streaming metric k-means coreset algorithm using O( −2 k log k log n) points of storage. The previous state-of-the-art required O( −4 k log k log 6 n) points.
Data streams emerged as a critical model for multiple applications that handle vast amounts of data. One of the most influential and celebrated papers in streaming is the "AMS" paper on computing frequency moments by Alon, Matias and Szegedy. The main question left open (and explicitly asked) by AMS in 1996 is to give the precise characterization for which functions G on frequency vectors mi (1 ≤ i ≤ n) can i∈[n] G(mi) be approximated efficiently, where "efficiently" means by a single pass over data stream and poly-logarithmic memory. No such characterization was known despite a tremendous amount of research on frequency-based functions in streaming literature. In this paper we finally resolve the AMS main question and give a precise characterization (in fact, a zero-one law) for all monotonically increasing functions on frequencies that are zero at the origin. That is, we consider all monotonic functions G : R → R such that G(0) = 0 and G can be computed in poly-logarithmic time and space and ask, for which G in this class is there an (1± )-approximation algorithm for computing i∈[n] G(m i ) for any polylogarithmic ? We give an algebraic characterization for all such G so that:• For all functions G in our class that satisfy our algebraic condition, we provide a very general and constructive way to derive an efficient (1± )-approximation algorithm for computing i∈[n] G(m i ) with polylogarithmic memory and a single pass over data stream; while• For all functions G in our class that do not satisfy our algebraic characterization, we show a lower bound * that requires greater then polylog memory for computing an approximation to i∈[n] G(mi) by any one-pass streaming algorithm.Thus, we provide a zero-one law for all monotonically increasing functions G which are zero at the origin. Our results are quite general. As just one illustrative example, our main theorem implies a lower bound for G(x) = (x(x − 1)) 0.5 arctan(x+1) , while for a function G(x) = (x(x + 1)) 0.5 arctan(x+1) our main theorem automatically yields a polylog memory one-pass (1 ± )-approximation algorithm for computing i∈[n] G(mi). For both of these examples no lower or upper bounds were known. Of course, these are just illustrative examples, and there are many others. One might argue that these two functions may not be of interest in practical applications -we stress that our law works for all functions in this class, and the above examples illustrate the power of our method.To the best of our knowledge, this is the first zero-one law in the streaming model for a wide class of functions, though we suspect that there are many more such laws to be discovered. Surprisingly, our upper bound requires only 4-wise independence and does not need the stronger machinery of Nisan's pseudorandom generators, even though our class captures multiple functions that previously required Nisan's generators. Furthermore, we believe that our methods can be extended to the more general models and complexity classes. For instance, the law also holds for a smaller class of n...
Abstract. The problem of (approximately) counting the number of triangles in a graph is one of the basic problems in graph theory. In this paper we study the problem in the streaming model. We study the amount of memory required by a randomized algorithm to solve this problem. In case the algorithm is allowed one pass over the stream, we present a best possible lower bound of Ω(m) for graphs G with m edges on n vertices. If a constant number of passes is allowed, we show a lower bound of Ω(m/T ), T the number of triangles. We match, in some sense, this lower bound with a 2-pass O(m/T 1/3 )-memory algorithm that solves the problem of distinguishing graphs with no triangles from graphs with at least T triangles. We present a new graph parameter ρ(G) -the triangle density, and conjecture that the space complexity of the triangles problem is Ω(m/ρ(G)). We match this by a second algorithm that solves the distinguishing problem using O(m/ρ(G))-memory.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.