The Dataset Scaling Problem (DSP) defined in previous work states:
Given an empirical set of relational tables D and a scale factor s, generate a database state D that is similar to D but s times its size
. A DSP solution is useful for application development (
s
< 1), scalability testing (
s
> 1) and anonymization (
s
= 1). Current solutions assume all table sizes scale by the same ratio
s
.
However, a real database tends to have tables that grow at different rates. This paper therefore considers
non-uniform scaling
(nuDSP), a DSP generalization where, instead of a single scale factor
s
, tables can scale by different factors.
D
scaler
is the first solution for nuDSP. It follows previous work in achieving similarity by reproducing correlation among the primary and foreign keys. However, it introduces the concept of a
correlation database
that captures fine-grained, per-tuple correlation.
Experiments with well-known real and synthetic datasets D show that D
scaler
produces D with greater similarity to D than state-of-the-art techniques. Here, similarity is measured by number of tuples, frequency distribution of foreign key references, and multi-join aggregate queries.