We review recent literature that proposes to adapt ideas from classical model based optimal design of experiments to problems of data selection of large datasets. Special attention is given to bias reduction and to protection against confounders. Some new results are presented. Theoretical and computational comparisons are made.
K E Y W O R D Sconfounders, large datasets, model bias, optimal experimental design
INTRODUCTIONFor the analysis of big datasets, statistical methods have been developed, which use the full available dataset. For example, new methodologies developed in the context of Big Data and focussed on a 'divide-and-recombine' approach are summarised in Wang et al. 19 Other major methods address the scalability of Big Data through Bayesian inference based on a Consensus Monte Carlo algorithm 13 and sparsity assumptions. 16 In contrast, other authors argue on the advantages of inference statements based on a well-chosen subset of the large dataset. Big datasets are characterised by few key factors. While usually data can be collected in scientific studies via active or passive observation, Big Data is often collected in passive way. Rarely their collection is the result of a designed process. This generates sources of bias, which either we do not know at all or are too costly to control. Nevertheless they will affect the overall distribution of the observed variables. 3,11 Many authors in Ref. 15 argues that analysis of big dataset is effected by issues of bias and confounding, selection bias and other sampling problems (see, for example, Sharpes 14 for electronic health records). Often the causal effect of interest can only be measured on the average and great care has to be taken about the background population, for example, it is possible to consider and analyse every message on Twitter and use it to draw conclusions about the public opinion, but it is known that Twitter users are not representative of the whole population. The analysis of the full dataset might be prohibitive because of computational and time constraints. Indeed, in some cases, the analysis of the full dataset might also be not advisable. 4,6 To recall just one example, where the sample proportion of a self-reported big dataset of size 2300,000 unit has the same mean squared error as the sample proportion from a suitable simple random sample (SRS) of size 400 and a Law of Large Population has been defined in order to qualify this (see Meng 9 ).Recently, some researchers argued on the usefulness of utilising methods and ideas from design of experiment (DoE) for the analysis of big datasets, more specifically from model-based optimal experimental design. They argue that specialThis is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made.