Summary
In the last few years, Apache Spark has become a de facto the standard framework for big data systems on both industry and academy projects. Spark is used to execute computeā and dataāintensive workflows in distinct areas like biology and astronomy. Although Spark is an easyātoāinstall framework, it has more than one hundred parameters to be set, besides domaināspecific parameters of each workflow. In this way, to execute Sparkābased workflows efficiently, the user has to fineātune a myriad of Spark and workflow parameters (eg, partitioning strategy, the average size of a DNA sequence, etc.). This configuration task cannot be manually performed in a trialāandāerror manner since it is tedious and errorāprone. This article proposes an approach that focuses on generating interpretable predictive machine learning models (ie, decision trees), and then extract useful rules (ie, patterns) from these models that can be applied to configure parameters of future executions of the workflow and Spark for nonexperts users. In the experiments presented in this article, the proposed parameter configuration approach led to better performance in processing Spark workflows. Finally, the approach introduced here reduced the number of parameters to be configured by identifying the most relevant domaināspecific ones related to the workflow performance in the predictive model.