Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a "sample" is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples from a data management perspective, and no techniques now exist for maintaining very large samples in an online manner from streaming data. In this paper, we present online algorithms for maintaining on-disk samples that are gigabytes or terabytes in size. The algorithms are designed for streaming data, or for any environment where a large sample must be maintained online in a single pass through a data set. The algorithms meet the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. Our algorithms are also suitable for biased or unequal probability sampling.
IntroductionDespite the variety of alternatives for approximate query processing (including several references listed in this paper [8] [9] [10] [14][29]), sampling is still one of the most powerful methods for building a one-pass synopsis of a data set in a streaming environment, where the assumption is that there is too much data to store all of it permanently. Sampling's many benefits include:•Sampling is the most widely-studied and best understood approximation technique currently available. Sampling has been studied for hundreds of years, and many fundamental results describe the utility of random samples (such as the central limit theorem, Chernoff, Hoeffding and Chebyshev bounds [7][25]).•Sampling is the most versatile approximation technique available. Most data processing algorithms can be used on a random sample of a data set rather than the original data with little or no modification. For example, almost any data mining algorithm for building a decision tree classifier can be run directly on a sample.•Sampling is the most widely-used approximation technique. . However, this work is relevant mostly for sampling from data stored in a database, and is not suitable for emerging applications such as stream-based data management. Furthermore, the implicit assumption in most existing work is that a "sample" is a small, in-memory data structure. This is not always true. For many applications, very large samples containing billions of records can be required to provide acceptable accuracy. Fortunately, modern storage hardware gives us the capacity to cheaply store very large samples that should suffice for even difficult and emerging applications, such as futuristic "smart dust" environments where billions of tiny sensors produce billions of observations per second that must be joined, cross-correlated, a...