Abstract-Big Data analytics has recently gained increasing popularity as a tool to process large amounts of data on-demand. Spark and Flink are two Apache-hosted data analytics frameworks that facilitate the development of multi-step data pipelines using directly acyclic graph patterns. Making the most out of these frameworks is challenging because efficient executions strongly rely on complex parameter configurations and on an in-depth understanding of the underlying architectural choices. Although extensive research has been devoted to improving and evaluating the performance of such analytics frameworks, most of them benchmark the platforms against Hadoop, as a baseline, a rather unfair comparison considering the fundamentally different design principles. This paper aims to bring some justice in this respect, by directly evaluating the performance of Spark and Flink. Our goal is to identify and explain the impact of the different architectural choices and the parameter configurations on the perceived end-to-end performance. To this end, we develop a methodology for correlating the parameter settings and the operators execution plan with the resource usage. We use this methodology to dissect the performance of Spark and Flink with several representative batch and iterative workloads on up to 100 nodes. Our key finding is that there none of the two framework outperforms the other for all data types, sizes and job patterns. This paper performs a fine characterization of the cases when each framework is superior, and we highlight how this performance correlates to operators, to resource usage and to the specifics of the internal framework design.
The emergence of cloud computing brought the opportunity to use large-scale computational infrastructures for a broad spectrum of scientific applications. As more and more cloud providers and technologies appear, scientists are faced with an increasingly difficult problem of evaluating various offerings, like public and private clouds, and deciding which model to use for their applications' needs. In this paper, we make a performance evaluation of two public and private cloud platforms for scientific computing workloads. We compare the Azure and Nimbus clouds, considering all the primary needs of scientific applications (computation power, storage, data transfers and costs). The evaluation is done using both synthetic benchmarks and a real-life application. Our results show that Nimbus incurs less varaibility and has increased support for data intensive applications, while Azure deploys faster and has a lower cost.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.