The Semantic Web contains many billions of statements, which are released using the resource description framework (RDF) data model. To better handle these large amounts of data, high performance RDF applications must apply a compression technique. Unfortunately, because of the large input size, even this compression is challenging. In this paper, we propose a set of distributed MapReduce algorithms to efficiently compress and decompress a large amount of RDF data. Our approach uses a dictionary encoding technique that maintains the structure of the data. We highlight the problems of distributed data compression and describe the solutions that we propose. We have implemented a prototype using the Hadoop framework, and evaluate its performance. We show that our approach is able to efficiently compress a large amount of data and scales linearly on both input size and number of nodes. SCALABLE RDF DATA COMPRESSION WITH MAPREDUCE 25 make dictionary encoding a feasible technique on a very large input, a distributed implementation is required. To the best of our knowledge, no distributed approach exists to solve this problem.In this paper, we propose a technique to compress and decompress RDF statements using the MapReduce programming model [6]. Our approach uses a dictionary encoding technique that maintains the original structure of the data. This technique can be used by all RDF applications that need to efficiently process a large amount of data, such as RDF storage engines, network analysis tools, and reasoners.Our compression technique was essential in our recent work on Semantic Web inference engines, as it allowed us to reason directly on the compressed statements with a consequent increase of performance. As a result, we were able to reason over tens of billions of statements [7,8], advancing the current state of the art in the field significantly.The compression technique we present in this paper has the following: (i) performance that scales linearly; (ii) the ability to build a very large dictionary of hundreds of millions of entries; and (iii) the ability to handle load balancing issues with sampling and caching.This paper is structured as follows. In Section 2, we discuss the conventional approach to dictionary encoding and highlight the problems that arise. Sections 3 and 4 describe how we have implemented the data compression and decompression in MapReduce. Section 5 evaluates our approach, and Section 6 describes related work. Finally, we conclude and discuss future work in Section 7.
DICTIONARY ENCODINGDictionary encoding is often used because of its simplicity. In our case, dictionary encoding has also the additional advantage that the compressed data can still be manipulated by the application. Traditional techniques such as gzip or bzip2 hide the original data so that reading without decompression is impossible. Algorithm 1 shows a sequential algorithm to compress and decompress RDF statements. The compression algorithm starts by initializing the dictionary table. The table has two columns, one tha...