The Dataset Scaling Problem (DSP) defined in previous work states:
Given an empirical set of relational tables D and a scale factor s, generate a database state D that is similar to D but s times its size
. A DSP solution is useful for application development (
< 1), scalability testing (
> 1) and anonymization (
= 1). Current solutions assume all table sizes scale by the same ratio
However, a real database tends to have tables that grow at different rates. This paper therefore considers
non-uniform scaling
(nuDSP), a DSP generalization where, instead of a single scale factor
, tables can scale by different factors.
is the first solution for nuDSP. It follows previous work in achieving similarity by reproducing correlation among the primary and foreign keys. However, it introduces the concept of a
correlation database
that captures fine-grained, per-tuple correlation.
Experiments with well-known real and synthetic datasets D show that D
produces D with greater similarity to D than state-of-the-art techniques. Here, similarity is measured by number of tuples, frequency distribution of foreign key references, and multi-join aggregate queries.