On the Timed Analysis of Big-Data Applications

Marconi, Francesco; Quattrocchi, Giovanni; Baresi, Luciano; Bersani, Marcello M.; Rossi, Matteo

doi:10.1007/978-3-319-77935-5_22

Cited by 5 publications

(3 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The work [16] presents a formal model for Spark applications based on temporal logic. The model takes into account the DAG that forms the program, information about the execution environment, such as the number of CPU cores available, the number of tasks of the program and the average execution time of the tasks.…”

Section: Related Workmentioning

confidence: 99%

Modeling Big Data Processing Programs

Neto

Moreira

Vargas-Solar

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

We propose a new model for data processing programs. Our model generalizes the data flow programming style implemented by systems such as Apache Spark, DryadLINQ, Apache Beam and Apache Flink. The model uses directed acyclic graphs (DAGs) to represent the main aspects of data flow-based systems, namely Operations over data (filtering, aggregation, join) and Program execution defined by data dependence between operations. We use Monoid Algebra to model operations over distributed, partitioned datasets and Petri Nets to represent the data/control flow. This allows the specification of a data processing program to be agnostic of the target Big Data processing system. Our model has been used to design mutation test operators for big data processing programs. These operators have been implemented by the testing environment TRANSMUT-Spark.

show abstract

Section: Related Workmentioning

confidence: 99%

Modeling Big Data Processing Programs

Neto

Moreira

Vargas-Solar

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Based on this specification, necessary and sufficient conditions are extracted to verify whether the outputs of aggregations in a Spark program are deterministic. The work[37] presents a formal model for Spark applications based on temporal logic. The model considers the DAG that forms the program, information about the execution environment, such as the number of CPU cores available, the number of program tasks, and the average execution time of the tasks.…”

mentioning

confidence: 99%

“…Then, the model is used to check time constraints and make predictions about the program's execution time. Both works ([11] and[37]) aim to evaluate Spark programs for specific properties. The abstraction level of our model is higher than that of the work of Marconi et al, so our model is not suited, as it is, to evaluate cluster behavior.…”

mentioning

confidence: 99%

A two-level formal model for Big Data processing programs

Neto

Moreira

Vargas-Solar

et al. 2022

Science of Computer Programming

View full text Add to dashboard Cite

Using formal verification to evaluate theexecution time of Spark applications

et al. 2020

View full text Add to dashboard Cite

Apache Spark is probably the most widely adopted framework for developing big-data batch applications and for executing them on a cluster of (virtual) machines. In general, the more resources (machines) one uses, the faster applications execute, but there is currently no adequate means to determine the proper size of a Spark cluster given time constraints, or to foresee execution times given the number of employed machines. One can only run these applications and use her/his experience to size the cluster and predict expected execution times. Wrong estimation of execution times can lead to costly overruns and overly long executions, thus calling for analytic sizing/prediction techniques that provide precise time guarantees. This paper addresses this problem by proposing a solution based on model-checking. The approach exploits a Directed Acyclic Graph (DAG) to abstract the structure of the execution flows of Spark programs, annotates each node (Spark stage) with execution-related data, and formulates the identification of the global execution time as a reachability problem. To avoid the well-known state space explosion problem, the paper also proposes a technique to reduce the size of generated abstract models. This results in a significant decrease in used memory and/or verification time making our approach feasible for predicting the execution time of Spark applications given the resources available. The benefits of the proposed reduction technique are evaluated by using both Timed Automata and Constraint LTL over clocks (CLTLoc) logic to formally encode and analyze generated models. The approach is also successfully validated on some realistic case studies. Since the optimization is not Spark-specific, we claim that it can be applied to a wide range of applications whose underlying model can be abstracted as a DAG.

show abstract

On the Timed Analysis of Big-Data Applications

Cited by 5 publications

References 15 publications

Modeling Big Data Processing Programs

Modeling Big Data Processing Programs

A two-level formal model for Big Data processing programs

Using formal verification to evaluate theexecution time of Spark applications

Contact Info

Product

Resources

About