A conditional sampling oracle for a probability distribution D returns samples from the conditional distribution of D restricted to a specified subset of the domain. A recent line of work [7,6] has shown that having access to such a conditional sampling oracle requires only polylogarithmic or even constant number of samples to solve distribution testing problems like identity and uniformity. This significantly improves over the standard sampling model where polynomially many samples are necessary.Inspired by these results, we introduce a computational model based on conditional sampling to develop sublinear algorithms with exponentially faster runtimes compared to standard sublinear algorithms. We focus on geometric optimization problems over points in high dimensional Euclidean space. Access to these points is provided via a conditional sampling oracle that takes as input a succinct representation of a subset of the domain and outputs a uniformly random point in that subset. We study two well studied problems: k-means clustering and estimating the weight of the minimum spanning tree. In contrast to prior algorithms for the classic model, our algorithms have time, space and sample complexity that is polynomial in the dimension and polylogarithmic in the number of points.Finally, we comment on the applicability of the model and compare with existing ones like streaming, parallel and distributed computational models. IntroductionConsider a scenario where you are given a dataset of input points X , from some domain Ω, stored in a random access memory and you want to estimate the number of distinct elements of this (multi-)set. One obvious way to do so is to iterate over all elements and use a hash table to find duplicates. Although simple, this solution becomes unattractive if the input is huge and it is too expensive to even parse it. In such cases, one natural goal is to get a good estimate of this number instead of computing it exactly. One way to do that is to pick some random points from X and estimate, based on those, the total number of distinct elements in the set. This is equivalent to getting samples from a probability distribution where the probability of each element is proportional to the number of times it appears in X . In the context of probability distributions, this is a well understood problem, called support estimation, and tight bounds are known for its sampling complexity. More specifically, in [22], it is shown that the number of samples needed is Θ(n/ log n) which, although sublinear, still has a huge dependence on the input size n = |X |.In several situations, more flexible access to the dataset might be possible, e.g. when data are stored in a database, which can significantly reduce the number of queries needed to perform support estimation or other tasks. One recent model, called conditional sampling, introduced by [7,6] for distribution testing, describes such a possibility. In that model, there is an underlying distribution D, and a conditional sampling oracle takes as input a ...
We study the online Traveling Salesman Problem (TSP) on the line augmented with machine-learned predictions. In the classical problem, there is a stream of requests released over time along the real line. The goal is to minimize the makespan of the algorithm. We distinguish between the open variant and the closed one, in which we additionally require the algorithm to return to the origin after serving all requests. The state of the art is a 1.64-competitive algorithm and a 2.04-competitive algorithm for the closed and open variants, respectively. In both cases, a tight lower bound is known. In both variants, our primary prediction model involves predicted positions of the requests. We introduce algorithms that (i) obtain a tight 1.5 competitive ratio for the closed variant and a 1.66 competitive ratio for the open variant in the case of perfect predictions, (ii) are robust against unbounded prediction error, and (iii) are smooth, i.e., their performance degrades gracefully as the prediction error increases. Moreover, we further investigate the learning-augmented setting in the open variant by additionally considering a prediction for the last request served by the optimal offline algorithm. Our algorithm for this enhanced setting obtains a 1.33 competitive ratio with perfect predictions while also being smooth and robust, beating the lower bound of 1.44 we show for our original prediction setting for the open variant. Also, we provide a lower bound of 1.25 for this enhanced setting.
A conditional sampling oracle for a probability distribution D returns samples from the conditional distribution of D restricted to a specified subset of the domain. A recent line of work [CFGM13,CRS14] has shown that having access to such a conditional sampling oracle requires only polylogarithmic or even constant number of samples to solve distribution testing problems like identity and uniformity. This significantly improves over the standard sampling model where polynomially many samples are necessary.Inspired by these results, we introduce a computational model based on conditional sampling to develop sublinear algorithms with exponentially faster runtimes compared to standard sublinear algorithms. We focus on geometric optimization problems over points in high dimensional Euclidean space. Access to these points is provided via a conditional sampling oracle that takes as input a succinct representation of a subset of the domain and outputs a uniformly random point in that subset. We study two well studied problems: k-means clustering and estimating the weight of the minimum spanning tree. In contrast to prior algorithms for the classic model, our algorithms have time, space and sample complexity that is polynomial in the dimension and polylogarithmic in the number of points.Finally, we comment on the applicability of the model and compare with existing ones like streaming, parallel and distributed computational models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.