Performance Prediction for Apache Spark Platform

Wang, Kewen; Khan, Mohammad Maifi Hasan

doi:10.1109/hpcc-css-icess.2015.246

Cited by 126 publications

(77 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The performance of the shared‐memory computation programs can be predicted with the Tanzil et al model, whereas in those programs with remote direct memory access, the Wasi‐ur‐Rahman et al model can be used. Apache Spark programs process the data using distributed memory abstraction, and their performance can be predicted by a model that executes a sample of data …”

Section: Resultsmentioning

confidence: 99%

Testing MapReduce programs: A systematic mapping study

Morán

Riva

Tuya

2018

J Software Evolu Process

View full text Add to dashboard Cite

Summary Context MapReduce is a processing model used in Big Data to facilitate the analysis of large data under a distributed architecture. Objective The aim of this study is to identify and categorize the state of the art of software testing in MapReduce applications, determining trends and gaps. Method Systematic mapping study to discuss and classify according to international standards 54 relevant studies in relation to reasons for testing, types of testing, quality characteristics, test activities, tools, roles, processes, test levels, and research validations. Results The principal reasons for testing MapReduce applications are performance issues, potential failures, issues related to the data, or to satisfy the agreements with efficient resources. The efforts are focused on performance and, to a lesser degree, on functionality. Performance testing is carried out through simulation and evaluation, whereas functional testing considers some program characteristics (such as specification and structure). Despite the type of testing, the majority of efforts are focused at the unit and integration test levels of the specific MapReduce functions without considering other parts of the technology stack. Conclusions Researchers have both opportunities and challenges in performance and functional testing, and there is room to improve their research though the use of mature and standard validation methods.

show abstract

Section: Resultsmentioning

confidence: 99%

Testing MapReduce programs: A systematic mapping study

Morán

Riva

Tuya

2018

J Software Evolu Process

View full text Add to dashboard Cite

show abstract

“…Tuning up its performance is an important concern of the community and yet there is not much related work. In [30], the authors present, to the best of our knowledge, the only Apache Spark prediction model. Again sampling the application with a smaller data size is used to get statistics about the duration of the tasks and plugged into a formula that gives an estimation of the total run time for a different file size.…”

Section: Related Workmentioning

confidence: 99%

Using machine learning to optimize parallelism in big data applications

Hernández

Pérez

Gupta

et al. 2018

Future Generation Computer Systems

View full text Add to dashboard Cite

In-memory cluster computing platforms have gained momentum in the last years, due to their ability to analyse big amounts of data in parallel. These platforms are complex and difficult-to-manage environments. In addition, there is a lack of tools to better understand and optimize such platforms that consequently form backbone of big data infrastructure and technologies. This directly leads to underutilization of available resources and application failures in such environment. One of the key aspects that can address this problem is optimization of the task parallelism of application in such environments. In this paper, we propose a machine learning based method that recommends optimal parameters for task parallelization in big data workloads. By monitoring and gathering metrics at system and application level, we are able to find statistical correlations that allow us to characterize and predict the effect of different parallelism settings on performance. These predictions are used to recommend an optimal configuration to users before launching their workloads in the cluster, avoiding possible failures, performance degradation and wastage of resources. We evaluate our method with a benchmark of 15 Spark applications on the Grid5000 testbed. We observe up to a 51% gain on performance when using the recommended parallelism settings. The model is also interpretable and can give insights to the user into how different metrics and parameters affect the performance.

show abstract

“…Other works compare Apache Spark with other frameworks such as MapReduce [72], study the performance of Apache Spark for specific scenarios such as scale-up configuration [10], analyze the performance of Spark's programming model for large-scale data analytics [78] and identify the performance bottlenecks in Apache Spark [66] [11]. In addition, as Apache Spark offers language-integrated APIs, there are some efforts to provide the APIs in other languages.…”

Section: Related Researchmentioning

confidence: 99%

Big data analytics on Apache Spark

Salloum

Dautov

Chen

et al. 2016

Int J Data Sci Anal

327

121

View full text Add to dashboard Cite

Apache Spark has emerged as the de facto framework for big data analytics with its advanced in-memory programming model and upper-level libraries for scalable machine learning, graph analysis, streaming and structured data processing. It is a general-purpose cluster computing framework with language-integrated APIs in Scala, Java, Python and R. As a rapidly evolving open source project, with an increasing number of contributors from both academia and industry, it is difficult for researchers to comprehend the full body of development and research behind Apache Spark, especially those who are beginners in this area. In this paper, we present a technical review on big data analytics using Apache Spark. This review focuses on the key components, abstractions and features of Apache Spark. More specifically, it shows what Apache Spark has for designing and implementing big data algorithms and pipelines for machine learning, graph analysis and stream processing. In addition, we highlight some research and development directions on Apache Spark for big data analytics.

show abstract

Performance Prediction for Apache Spark Platform

Cited by 126 publications

References 10 publications

Testing MapReduce programs: A systematic mapping study

Testing MapReduce programs: A systematic mapping study

Using machine learning to optimize parallelism in big data applications

Big data analytics on Apache Spark

Contact Info

Product

Resources

About