Random sampling from database files: A survey

Olken, Frank; Rotem, Doron

doi:10.1007/3-540-52342-1_23

Cited by 61 publications

(54 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Olken and Rotem give an excellent survey of work in this area [23]. However, most of this work is very different than ours, in that it is concerned primarily with sampling from an existing database file, where it is assumed that the data to be sampled from are all present on disk and indexed by the database.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Online maintenance of very large random samples

Jermaine

Pol

Arumugam

2004

Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data

View full text Add to dashboard Cite

Random sampling is one of the most fundamental data management tools available. However, most current research involving sampling considers the problem of how to use a sample, and not how to compute one. The implicit assumption is that a "sample" is a small data structure that is easily maintained as new data are encountered, even though simple statistical arguments demonstrate that very large samples of gigabytes or terabytes in size can be necessary to provide high accuracy. No existing work tackles the problem of maintaining very large, disk-based samples from a data management perspective, and no techniques now exist for maintaining very large samples in an online manner from streaming data. In this paper, we present online algorithms for maintaining on-disk samples that are gigabytes or terabytes in size. The algorithms are designed for streaming data, or for any environment where a large sample must be maintained online in a single pass through a data set. The algorithms meet the strict requirement that the sample always be a true, statistically random sample (without replacement) of all of the data processed thus far. Our algorithms are also suitable for biased or unequal probability sampling. IntroductionDespite the variety of alternatives for approximate query processing (including several references listed in this paper [8] [9] [10] [14][29]), sampling is still one of the most powerful methods for building a one-pass synopsis of a data set in a streaming environment, where the assumption is that there is too much data to store all of it permanently. Sampling's many benefits include:•Sampling is the most widely-studied and best understood approximation technique currently available. Sampling has been studied for hundreds of years, and many fundamental results describe the utility of random samples (such as the central limit theorem, Chernoff, Hoeffding and Chebyshev bounds [7][25]).•Sampling is the most versatile approximation technique available. Most data processing algorithms can be used on a random sample of a data set rather than the original data with little or no modification. For example, almost any data mining algorithm for building a decision tree classifier can be run directly on a sample.•Sampling is the most widely-used approximation technique. . However, this work is relevant mostly for sampling from data stored in a database, and is not suitable for emerging applications such as stream-based data management. Furthermore, the implicit assumption in most existing work is that a "sample" is a small, in-memory data structure. This is not always true. For many applications, very large samples containing billions of records can be required to provide acceptable accuracy. Fortunately, modern storage hardware gives us the capacity to cheaply store very large samples that should suffice for even difficult and emerging applications, such as futuristic "smart dust" environments where billions of tiny sensors produce billions of observations per second that must be joined, cross-correlated, a...

show abstract

Section: Related Workmentioning

confidence: 99%

“…The most well-known papers in this area are due to Olken and Rotem [22][ 24], who also offer the definitive survey of related work through the early 1990's [23]. However, this work is relevant mostly for sampling from data stored in a database, and is not suitable for emerging applications such as stream-based data management.…”

Section: Introductionmentioning

confidence: 99%

Online maintenance of very large random samples

Jermaine

Pol

Arumugam

2004

Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data

View full text Add to dashboard Cite

show abstract

“…In addition to the previous techniques, window query selectivity on nonuniform data can be estimated using fractals and power laws [Belussi and Faloutsos 1995;Faloutsos 1998, 2001], sampling [Olken and Rotem 1990;Palmer and Faloutsos 2000;Chaudhuri et al 2001;Wu et al 2001], kernel estimation [Blohsfeld et al 1999], single value decomposition [Poosala and Ioannidis 1997], compressed histograms [Matias et al 1998[Matias et al , 2000Lee et al Fig. 2.…”

Section: Selectivity and Nearest Distance In Spatial Databasesmentioning

confidence: 99%

Analysis of predictive spatio-temporal queries

Tao

Sun

Papadias

2003

ACM Trans. Database Syst.

View full text Add to dashboard Cite

Given a set of objects S, a spatio-temporal window query q retrieves the objects of S that will intersect the window during the (future) interval q T . A nearest neighbor query q retrieves the objects of S closest to q during q T . Given a threshold d, a spatio-temporal join retrieves the pairs of objects from two datasets that will come within distance d from each other during q T . In this article, we present probabilistic cost models that estimate the selectivity of spatio-temporal window queries and joins, and the expected distance between a query and its nearest neighbor(s). Our models capture any query/object mobility combination (moving queries, moving objects or both) and any data type (points and rectangles) in arbitrary dimensionality. In addition, we develop specialized spatio-temporal histograms, which take into account both location and velocity information, and can be incrementally maintained. Extensive performance evaluation verifies that the proposed techniques produce highly accurate estimation on both uniform and non-uniform data.

show abstract

“…Such information can be used for statistical analyses of databases, where approximate answers would suffice. It may also be used to estimate selectivities or intermediate result sizes for query optimization [11]. In the context of association rules, sampling can be utilized to gather quick preliminary rules.…”

Section: Introductionmentioning

confidence: 99%