Parallel fractional hot-deck imputation (P-FHDI [1]) is a general-purpose, assumption-free tool for handling item nonresponse in big incomplete data by combining the theory of FHDI and parallel computing. FHDI cures multivariate missing data by filling each missing unit with multiple observed values (thus, hot-deck) without resorting to distributional assumptions. P-FHDI can tackle big incomplete data with millions of instances (big-n) or 10, 000 variables (big-p). However, handling ultra incomplete data (i.e., concurrently big-n and big-p) with tremendous instances and high dimensionality has posed problems to P-FHDI due to excessive memory requirement and execution time. To tackle the aforementioned challenges, we propose the ultra data-oriented P-FHDI (named UP-FHDI) capable of curing ultra incomplete data. In addition to the parallel Jackknife method, this paper enables a computationally efficient ultra data-oriented variance estimation using parallel linearization techniques. Results confirm that UP-FHDI can tackle an ultra dataset with one million instances and 10, 000 variables. This paper illustrates the special parallel algorithms of UP-FHDI and confirms its positive impact on the subsequent deep learning performance.
This paper deals with making inference on parameters of a two-level model matching the design hierarchy of a two-stage sample. In a pioneering paper, Scott and Smith (Journal of the American Statistical Association, 1969, 64, 830-840) proposed a Bayesian model based or prediction approach to estimating a finite population mean under two-stage cluster sampling. We provide a brief account of their pioneering work. We review two methods for the analysis of two-level models based on matching two-stage samples. Those methods are based on pseudo maximum likelihood and pseudo composite likelihood taking account of design weights. We then propose a new method for analysis of two-level models based on a normal approximation to the estimated cluster effects and taking account of design weights. This method does not require cluster sizes to be constants or unrelated to cluster effects. We evaluate the relative performance of the three methods in a simulation study.Finally, we apply the methods to real data obtained from 2011 Nepal Demographic and Health Survey (NDHS).
Machine learning (ML) advancements hinge upon data - the vital ingredient for training. Statistically-curing the missing data is called imputation, and there are many imputation theories and tools. Butthey often require difficult statistical and/or discipline-specific assumptions, lacking general tools capable of curing large data. Fractional hot deck imputation (FHDI) can cure data by filling nonresponses with observed values (thus, "hot-deck") without resorting to assumptions. The review paper summarizes how FHDI evolves to ultra dataoriented parallel version (UP-FHDI).Here, "ultra" data have concurrently large instances (bign) and high dimensionality (big-p). The evolution is made possible with specialized parallelism and fast variance estimation technique. Validations with scientific and engineering data confirm that UP-FHDI can cure ultra data(p >10,000& n > 1M), and the cured data sets can improve the prediction accuracy of subsequent ML. The evolved FHDI will help promote reliable ML with "cured" big data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.