A conditional sampling oracle for a probability distribution D returns samples from the conditional distribution of D restricted to a specified subset of the domain. A recent line of work [7,6] has shown that having access to such a conditional sampling oracle requires only polylogarithmic or even constant number of samples to solve distribution testing problems like identity and uniformity. This significantly improves over the standard sampling model where polynomially many samples are necessary.Inspired by these results, we introduce a computational model based on conditional sampling to develop sublinear algorithms with exponentially faster runtimes compared to standard sublinear algorithms. We focus on geometric optimization problems over points in high dimensional Euclidean space. Access to these points is provided via a conditional sampling oracle that takes as input a succinct representation of a subset of the domain and outputs a uniformly random point in that subset. We study two well studied problems: k-means clustering and estimating the weight of the minimum spanning tree. In contrast to prior algorithms for the classic model, our algorithms have time, space and sample complexity that is polynomial in the dimension and polylogarithmic in the number of points.Finally, we comment on the applicability of the model and compare with existing ones like streaming, parallel and distributed computational models.
IntroductionConsider a scenario where you are given a dataset of input points X , from some domain Ω, stored in a random access memory and you want to estimate the number of distinct elements of this (multi-)set. One obvious way to do so is to iterate over all elements and use a hash table to find duplicates. Although simple, this solution becomes unattractive if the input is huge and it is too expensive to even parse it. In such cases, one natural goal is to get a good estimate of this number instead of computing it exactly. One way to do that is to pick some random points from X and estimate, based on those, the total number of distinct elements in the set. This is equivalent to getting samples from a probability distribution where the probability of each element is proportional to the number of times it appears in X . In the context of probability distributions, this is a well understood problem, called support estimation, and tight bounds are known for its sampling complexity. More specifically, in [22], it is shown that the number of samples needed is Θ(n/ log n) which, although sublinear, still has a huge dependence on the input size n = |X |.In several situations, more flexible access to the dataset might be possible, e.g. when data are stored in a database, which can significantly reduce the number of queries needed to perform support estimation or other tasks. One recent model, called conditional sampling, introduced by [7,6] for distribution testing, describes such a possibility. In that model, there is an underlying distribution D, and a conditional sampling oracle takes as input a ...