Apache Spark has become one of the most popular engines for big data processing. Spark provides a platform-independent, high-abstraction programming paradigm for large-scale data processing by leveraging the Java framework. Though it provides software portability across various machines, Java also limits the performance of distributed environments, such as Spark. While it may be unrealistic to rewrite platforms like Spark in a faster language, a more viable approach to mitigate its poor performance is to accelerate the computations while still working within the Java-based framework. This paper demonstrates the feasibility of incorporating Field-Programmable Gate Array (FPGA) acceleration into Spark and presents the performance benefits and bottlenecks of our FPGA-accelerated Spark environment using a MapReduce implementation of the k-means clustering algorithm, to show that acceleration is possible even when using a hardware platform that is not well optimized for performance. An important feature of our approach is that the use of FPGAs is completely transparent to the user through the use of library functions, which is a common way by which users access functions provided by Spark. Power users can further develop other computations using high-level synthesis.
KEYWORDSApache Spark, big data, FPGA, high-level synthesis, Java, MapReduce
INTRODUCTIONApache Spark 1 is one of the mainstream platforms for large-scale data computation in distributed computing. Built on a Java framework, Spark aims to create a platform-independent, high-abstraction programming paradigm, making large-scale application development easier while handling scalability-related issues, such as task scheduling and data distribution. Although Spark's framework provides high portability and ease-of-use for application developers, using Java creates enormous performance bottlenecks, especially for compute-intensive applications. To improve Spark performance, it is necessary to mitigate the handicap of using Java by introducing the ability to do high-performance computations while maintaining the easy application development provided by Java. In addition, given the already widespread adoption of Spark, it is important to preserve and maintain compatibility to the current Spark programming framework. Field-Programmable Gate Arrays (FPGAs) are integrated circuits with programmable logic blocks and interconnects that can be configured to implement any digital circuit. Due to their reprogrammbility, FPGAs can be used to implement a variety of different applications in a computing environment. In addition to providing programmable blocks to build basic hardware logic functions, FPGAs also contain embedded blocks of memory and Digital Signal Processing (DSP) units. The DSP units can be used to efficiently implement many arithmetic operations, such as integer and floating-point multiplications. 2 Developers can use FPGAs to implement custom hardware circuits to accelerate algorithms with dedicated computing circuits that typically leverage various forms o...