Chiron: Optimizing Fault Tolerance in QoS-aware Distributed Stream Processing Jobs

Geldenhuys, Morgan K.; Thamsen, Lauritz; Kao, Odej

doi:10.1109/bigdata50022.2020.9378474

Cited by 10 publications

(7 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A number of approaches have been proposed that optimize the fault tolerance configuration parameters by finding an optimal checkpoint interval (CI) to improve performance. Our previous work [11] focused on predicting recovery times and then optimizing the CI with regards to a single userdefined QoS constraint. However, this approach is aimed at scenarios where jobs process a static workload, i.e.…”

Section: B Adaptive Checkpointingmentioning

confidence: 99%

“…Checkpointing mechanisms are among the most popular and effective techniques for achieving fault tolerance in real world processing systems and consequently a number of methods for auto-configuration of checkpointing have been proposed. Most of them try to optimize the checkpoint interval by means of predicting or utilizing failure rates [7], the Mean Time To Failure (MTTF) [8]- [10], or recovery times [11], whereas some approaches even employ advanced multi-level checkpointing [12]- [17]. Yet, the majority of such methods either assume static workloads, consider solely offline optimization, or are primarily designed for high-performance computing (HPC) environments, which renders them not suitable for real-world DSP systems.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing

Geldenhuys¹,

Pfister²,

Scheinert³

et al. 2022

Annals of Computer Science and Information Systems

Self Cite

View full text Add to dashboard Cite

Distributed Stream Processing systems are becoming an increasingly essential part of Big Data processing platforms as users grow ever more reliant on their ability to provide fast access to new results. As such, making timely decisions based on these results is dependent on a system's ability to tolerate failure. Typically, these systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing checkpoint and rollback recovery. However, owing to the statistical probability of partial failures occurring in these distributed environments and the variability of workloads upon which jobs are expected to operate, static configurations will often not meet Quality of Service constraints with low overhead.In this paper we present Khaos, a new approach which utilizes the parallel processing capabilities of cloud orchestration technologies for the automatic runtime optimization of fault tolerance configurations in Distributed Stream Processing jobs. Our approach employs three subsequent phases which borrows from the principles of Chaos Engineering: establish the steadystate processing conditions, conduct experiments to better understand how the system performs under failure, and use this knowledge to continuously minimize Quality of Service violations. We implemented Khaos prototypically together with Apache Flink and demonstrate its usefulness experimentally.

show abstract

Section: B Adaptive Checkpointingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing

Geldenhuys¹,

Pfister²,

Scheinert³

et al. 2022

Annals of Computer Science and Information Systems

Self Cite

View full text Add to dashboard Cite

show abstract

“…Our overall approach borrows from Chiron [4]. Chiron uses a profiling-based approach to measure the capacity of DSP jobs with QoS requirements to find optimal checkpoint intervals.…”

Section: Related Workmentioning

confidence: 99%

“…In order to test the maximum capacity and increase the number of events processed by the DSP job, events are read from an earlier timestamp. All duplicate pipelines read from the same Kafka topic to increase accuracy [4]. Chiron builds on Timon [3], which tests alternate DSP configurations by deploying parallel pipelines that read from production data streams.…”

Section: Related Workmentioning

confidence: 99%

Rafiki: Task-Level Capacity Planning in Distributed Stream Processing Systems

Pfister

Lickefett

Nitschke

et al. 2022

Euro-Par 2021: Parallel Processing Workshops

View full text Add to dashboard Cite

“…Prominent applications within this domain include: the automated dynamic scaling of resources which aims to reduce over-and under-provisioning, minimizing operating costs and preventing possible reductions in the service quality [6]- [9]; The live migration of functionality/state across the network where cluster metrics are collected, processed, and scanned for anomalies which might reveal performance degradations and/or signs of component failure [10]- [12]; and the automatic system tuning where dynamic runtime adjustment of system configurations are performed in order to improve overall system availability and reliability [13]- [15]. The majority of these approaches rely on coarse-grained metrics to reactively make remediation decisions.…”

Section: Introductionmentioning

confidence: 99%

Evaluation of Load Prediction Techniques for Distributed Stream Processing

Gontarska¹,

Geldenhuys²,

Scheinert³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Distributed Stream Processing (DSP) systems enable processing large streams of continuous data to produce results in near to real time. They are an essential part of many data-intensive applications and analytics platforms. The rate at which events arrive at DSP systems can vary considerably over time, which may be due to trends, cyclic, and seasonal patterns within the data streams. A priori knowledge of incoming workloads enables proactive approaches to resource management and optimization tasks such as dynamic scaling, live migration of resources, and the tuning of configuration parameters during run-times, thus leading to a potentially better Quality of Service.In this paper we conduct a comprehensive evaluation of different load prediction techniques for DSP jobs. We identify three use-cases and formulate requirements for making load predictions specific to DSP jobs. Automatically optimized classical and Deep Learning methods are being evaluated on nine different datasets from typical DSP domains, i.e. the IoT, Web 2.0, and cluster monitoring. We compare model performance with respect to overall accuracy and training duration. Our results show that the Deep Learning methods provide the most accurate load predictions for the majority of the evaluated datasets.

show abstract

Chiron: Optimizing Fault Tolerance in QoS-aware Distributed Stream Processing Jobs

Cited by 10 publications

References 15 publications

Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing

Khaos: Dynamically Optimizing Checkpointing for Dependable Distributed Stream Processing

Rafiki: Task-Level Capacity Planning in Distributed Stream Processing Systems

Evaluation of Load Prediction Techniques for Distributed Stream Processing

Contact Info

Product

Resources

About