Background: Expansion of Internet and its use for on-line activities such as E-Commerce and social networking are producing large volumes of transactional data. This huge data volume resulted from these activities facilitates the analysis and understanding of global trends and interesting patterns used for several decisive purposes. Analytics involved in these processes expose sensitive information present in these datasets, which is a serious privacy threat. To overcome this challenge, few sequential heuristics have been used in past where volumes of data were comparatively accommodating to these sequential heuristics; the current situation is not that much in-line and often results in high execution time. This new challenge of scalability paves a way for experimenting with Big Data approaches (e.g., MapReduce Framework). We have agglomerated the MapReduce framework with adopted heuristics to overcome this challenge of scalability along with much-needed privacy preservation and yields efficient analytic results within bounded execution times.
Methods:MapReduce is a parallel programming framework [16] which provides us the opportunity to leverage largely distributed resources to deal with the Big Data analytics. MapReduce allows the resource of a largely distributed system to be utilized in a parallel fashion. The simplicity and high fault-tolerance are the key features which make MapReduce a promising framework. Therefore, we have proposed a two-phase MapReduce version of these adopted heuristics. MapReduce framework divides the whole data into 'n' number of data chunks D = {d 1 d ∪ 2 ∪ d 3 ..... ∪ d n } and distributes them over 'n' computing nodes to achieve the parallelization. The first phase of MapReduce job runs on each data chunk in order to generate intermediate results, which are further sorted and merged in the second phase to generate final sanitized dataset.
Results:We conducted three set of experiments, each with five different scenarios corresponding to the different cluster sizes i.e., n = 1,2,3,4,5 where 'n' is a number of computing nodes. We compared the approaches with respect to real as well as synthetically generated large datasets. For varying data sizes and varying number of computing nodes, it has been observed that sanitization time required by the MapReduce-based algorithm for same size dataset is much less than the sequential traditional approach. Further, the scalability can be improved by using more number of computing nodes. Lastly, another set of experiments explores the change in sanitization time with varying sizes of the sensitive content present in a dataset. We evaluated the effectiveness of proposed approach in different scenarios, with varying cluster size from 1 to 5 nodes. It has been observed that still the execution time of our approach is much less than traditional schemes. Further, no hiding failure, artifactual patterns Sharma and Toshniwal J Big Data (2017) 4:4 DOI 10.1186/s40537-017-0064-9 Page 2 of 18 Sharma and Toshniwal J Big Data (2017) 4:4 ha...