This paper describes SEMROD, a sensitive data aware MapReduce (MR) framework for hybrid clouds. SEMROD steers data and computation through public and private machines in such a way that no knowledge about sensitive data is leaked to public machines. For this purpose, SEMROD keeps trace of intermediate keys (generated during MR execution) that become sensitive, based on which it makes dynamic task scheduling decisions. SEMROD guarantees that adversaries (viz. public machines) cannot gain any "additional" information about sensitive data from either the data stored on public machines or the communication between public and private machines during job execution. SEMROD extends naturally from a single MR job to multiphase MR jobs that result, for instance, from compiling Hive queries into MR jobs. Using SEMROD, computation that may involve sensitive data can exploit public machines, thereby bringing significant performance benefits. Such computation would otherwise be restricted to only private clouds. Our experiments clearly demonstrate performance advantages to using SEMROD as compared with other secure alternatives, even when the percentage of sensitive data is as high as 50%.
Abstract. Data declustering is used to minimize query response times in data intensive applications. In this technique, query retrieval process is parallelized by distributing the data among several disks and it is useful in applications such as geographic information systems that access huge amounts of data. Declustering with replication is an extension of declustering with possible data replicas in the system. Many replicated declustering schemes have been proposed. Most of these schemes generate two or more copies of all data items. However, some applications have very large data sizes and even having two copies of all data items may not be feasible. In such systems selective replication is a necessity. Furthermore, existing replication schemes are not designed to utilize query distribution information if such information is available. In this study we propose a replicated declustering scheme that decides both on the data items to be replicated and the assignment of all data items to disks when there is limited replication capacity. We make use of available query information in order to decide replication and partitioning of the data and try to optimize aggregate parallel response time. We propose and implement a Fiduccia-Mattheyses-like iterative improvement algorithm to obtain a two-way replicated declustering and use this algorithm in a recursive framework to generate a multi-way replicated declustering. Experiments conducted with arbitrary queries on real datasets show that, especially for low replication constraints, the proposed scheme yields better performance results compared to existing replicated declustering schemes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.