A formal framework for database sampling

2013 IEEE 14th International Conference on Information Reuse &Amp; Integration (IRI)

Cerqueus

Kristiansen

et al. 2013

Abstract-In a wide range of application areas (e.g. data mining, approximate query evaluation, histogram construction), database sampling has proved to be a powerful technique. It is generally used when the computational cost of processing large amounts of information is extremely high, and a faster response with a lower level of accuracy for the results is preferred. Previous sampling techniques achieve this balance, however, an evaluation of the cost of the database sampling process should be considered. We argue that the performance of current relational database sampling techniques that maintain the data integrity of the sample database is low and a faster strategy needs to be devised. In this paper we propose a very fast sampling method that maintains the referential integrity of the sample database intact. The sampling method targets the production environment of a system under development, that generally consists of large amounts of data computationally costly to analyze. We evaluate our method in comparison with previous database sampling approaches and show that our method produces a sample database at least 300 times faster and with a maximum trade off of 0.5% in terms of sample size error.

Section: Related Workmentioning

confidence: 99%

VFDS: Very fast database sampling system

2013 IEEE 14th International Conference on Information Reuse &Amp; Integration (IRI)

Cerqueus

Kristiansen

et al. 2013

“…The database sampling approach presented in [3] is oriented towards relational databases focusing on the advantage of using prototype databases populated with operational data. Data items that follow a set of integrity constraints (e.g.…”

Section: General Approachesmentioning

confidence: 99%

CoDS: A Representative Sampling Method for Relational Databases

Lecture Notes in Computer Science

Cerqueus

Murphy

et al. 2013

Abstract. Database sampling has become a popular approach to handle large amounts of data in a wide range of application areas such as data mining or approximate query evaluation. Using database samples is a potential solution when using the entire database is not cost-effective, and a balance between the accuracy of the results and the computational cost of the process applied on the large data set is preferred. Existing sampling approaches are either limited to specific application areas, to single table databases, or to random sampling. In this paper, we propose CoDS: a novel sampling approach targeting relational databases that ensures that the sample database follows the same distribution for specific fields as the original database. In particular it aims to maintain the distribution between tables. We evaluate the performance of our algorithm by measuring the representativeness of the sample with respect to the original database. We compare our approach with two existing solutions, and we show that our method performs faster and produces better results in terms of representativeness.

“…Existing data population tools for the testing environment focus on populating the resulting database with synthetic data values or use some type of random distribution to select the data that must be included in the resulting database [5].…”

Section: Related Workmentioning

confidence: 99%

“…However, the existing tools do not consider the dependencies between the data in a relational database, but are limited to random sampling, while maintaining various constraints (e.g. referential integrity constraint, domain constraint) [5], and generally they are oriented towards a specific application area ( [10], [7], [12], [8]). The objective of our research is a novel approach for database sampling, which would ensure the sample database respects the same relationships between data as the original database by verifying that both follow the same histograms for specific fields.…”

Section: Introductionmentioning

confidence: 99%

Towards realistic sampling

Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication

Murphy

Kristiansen

2013

Managing large amounts of information is one of the most expensive, time-consuming and non-trivial activities and it usually requires expert knowledge. In a wide range of application areas, such as data mining, histogram construction, approximate query evaluation, and software validation, handling exponentially growing databases has become a difficult challenge, and a subset of the data is generally preferred. As a solution to the current challenges in managing large amounts of data, database sampling from the operational data available has proved to be a powerful technique. However, none of the existing sampling approaches consider the dependencies between the data in a relational database. In this paper, we propose a novel approach towards constructing a realistic testing environment, by analyzing the distribution of data in the original database along these dependencies before sampling, so that the sample database is representative to the original database.