Comparing Distributed Online Stream Processing Systems Considering Fault Tolerance Issues

Gradvohl, André Leon Sampaio; Senger, Hermes; Arantes, Luciana; Sens, Pierre

doi:10.4304/jetwi.6.2.174-179

Cited by 15 publications

(9 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some distributed checkpointing schemes such as Meteor Shower [33] have each operator checkpointing independently. This type of a scheme imposes additional overhead and needs more effort to maintain a consistent global state compared to system-wide checkpoint [14,3]. For instance, this approach requires saving the message buffers at each operator to recover from failures whereas a system-wide checkpoint saves message buffers only at the sources.…”

Section: Utilization Of a Stream Processing Systemmentioning

confidence: 99%

A utilization model for optimization of checkpoint intervals in distributed stream processing systems

Jayasekara

Harwood

Karunasekera

2020

Future Generation Computer Systems

View full text Add to dashboard Cite

State-of-the-art distributed stream processing systems such as Apache Flink and Storm have recently included checkpointing to provide fault-tolerance for stateful applications. This is a necessary eventuality as these systems head into the Exascale regime, and is evidently more efficient than replication as state size grows. However current systems use a nominal value for the checkpoint interval, indicative of assuming roughly 1 failure every 19 days, that does not take into account the salient aspects of the checkpoint process, nor the system scale, which can readily lead to inefficient system operation. To address this shortcoming, we provide a rigorous derivation of utilization -the fraction of total time available for the system to do useful work -that incorporates checkpoint interval, failure rate, checkpoint cost, failure detection and restart cost, depth of the system topology and message delay. Our model yields an elegant expression for utilization and provides an optimal checkpoint interval given these parameters, interestingly showing it to be dependent only on checkpoint cost and failure rate. We confirm the accuracy and efficacy of our model through experiments with Apache Flink, where we obtain improvements in system utilization for every case, especially as the system size increases. Our model provides a solid theoretical basis for the analysis and optimization of more elaborate checkpointing approaches.

show abstract

Section: Utilization Of a Stream Processing Systemmentioning

confidence: 99%

A utilization model for optimization of checkpoint intervals in distributed stream processing systems

Jayasekara

Harwood

Karunasekera

2020

Future Generation Computer Systems

View full text Add to dashboard Cite

show abstract

“…However, using all components are not mandatory, and an actual system may have only some of these features. The communication between components often uses TCP/IP protocols (Gradvohl et al, 2014).…”

Section: Fundamental Conceptsmentioning

confidence: 99%

“…These streams are potentially unbounded data transmitted at high volume and high velocities. Some of them require real-time processing and analysis, such as disaster management, network attack and anomaly detection, financial market, trend analysis, social media, web analytics, Internet of Things (IoT), operational infrastructure monitoring, and online advertising (de Assunção et al, 2018, Gradvohl et al, 2014.…”

Section: Introductionmentioning

confidence: 99%

Evaluating the impact of a coordinated checkpointing in distributed data streams processing systems using discrete event simulation

Moraes¹,

Gradvohl²

2020

RBCA

View full text Add to dashboard Cite

Data Streams Processing systems process continuous flows of data under Quality of Service requirements. Data streams often contain critical information which requires real-time processing. To guarantee systems' dependability and avoid information loss, one must use a fault-tolerance strategy. However, there are several strategies available, and the proper evaluation of which mechanism is better for each system architecture is challenging, especially in large-scale distributed systems. In this paper, we propose a discrete simulation model for investigating the impacts of the Coordinated Checkpoint fault tolerance strategy imposes on Data Stream Processing Systems. Results show that this strategy critically affects stream processing in failure-prone situations due to an increase in latency up to 120% and information loss, reaching 95% of the processing window in the worst case.

show abstract

“…In the next section, we present the main Big Data technologies. We also present, in the same section, a comparative study of these frameworks (Chintapalli et al, 2016;Gradvohl et al, 2014;Zhang et al, 2017).…”

Section: Big Data Analysismentioning

confidence: 99%

A Study on Big Data Frameworks and Machine Learning Tool Kits

Sassi¹,

Anter²

2019

Proceedings of the International Conferences Big Data Analytics, Data Mining and Computational Intelligence 2019; And Theory An

View full text Add to dashboard Cite

Big Data is an extremely large amount of structured and unstructured data, gathered from a wide range of sources which often require a fast processing and real time analysis. In this new context, the performances of the traditional techniques are limited. However, to handle these bulky quantities of data, new technologies emerged, called Big Data technologies. In fact, the characteristics of Big Data made the exploration process of these data a painful task. This process is called Big Data Analytics. One of the important challenges of Big Data is to search new technologies or to improve and extend the existing platforms, infrastructures and standard techniques to manage the Big Data. Hadoop / MapReduce paradigm and the Spark framework are among the most prominent solutions for large-scale parallel distributed data processing alongside Machine Learning techniques, in particularly, Deep Learning for performing powerful statistical and predictive analysis. In this paper, we first, give an overview, a classification and a comparison of main Big Data technologies. Then, we focus in particular on Machine Learning platforms and libraries, especially those for Deep Learning. The results show that Spark is a general-purpose computation engine thanks to its very generalized solutions.

show abstract

Comparing Distributed Online Stream Processing Systems Considering Fault Tolerance Issues

Cited by 15 publications

References 22 publications

A utilization model for optimization of checkpoint intervals in distributed stream processing systems

A utilization model for optimization of checkpoint intervals in distributed stream processing systems

Evaluating the impact of a coordinated checkpointing in distributed data streams processing systems using discrete event simulation

A Study on Big Data Frameworks and Machine Learning Tool Kits

Contact Info

Product

Resources

About