Introduction
MotivationSampling-based approaches have been adopted to alleviate the burden of big data volume not only when approximate results are useful as exact ones [1][2][3][4][5], but also when the results from a small clean sample can be more accurate than those from the entire dirty data [6][7][8][9]. It is a common practice to iteratively generate small random samples of a big data set to explore the statistical properties of the entire data and define cleaning rules [10][11][12][13][14][15][16][17][18][19]. This iterative process becomes impractical or impossible on small computing clusters due to the communication, I/O and memory costs of cluster computing frameworks that implement a shared-nothing architecture [20][21][22]. While these distributed frameworks have not adapted well to the requirements of data exploration tasks, existing sequential techniques don't scale easily to big data [23]. In fact, there are plenty of data exploration and analysis libraries in common data science languages, e.g., R and Python [24,25]. To scale these libraries to big data on computing clusters, new distributed implementations are required to process distributed data. Even with distributed algorithms, the memory of the computing cluster may not be enough to hold the entire Abstract Data scientists need scalable methods to explore and clean big data before applying advanced data analysis and mining algorithms. In this paper, we propose the RSP-Explore method to enable data scientists to iteratively explore big data on small computing clusters. We address three main tasks: statistical estimation, error detection, and data cleaning. The Random Sample Partition (RSP) distributed data model is used to represent the data as a set of ready-to-use random sample data blocks (called RSP blocks) of the entire data. Block-level samples of RSP blocks are selected to understand the data, identify potential types of value errors, and get samples of clean data. We provide a theoretical analysis on using RSP blocks for statistical estimation and demonstrate empirically the advantages of the RSP-Explore method. The experimental results of three real data sets show that the approximate results from RSP-Explore can rapidly converge toward the true values. Furthermore, cleaning a sample of RSP blocks is sufficient to estimate the statistical properties of the unknown clean data.