2020 IEEE International Conference on Big Data (Big Data) 2020
DOI: 10.1109/bigdata50022.2020.9378474
|View full text |Cite
|
Sign up to set email alerts
|

Chiron: Optimizing Fault Tolerance in QoS-aware Distributed Stream Processing Jobs

Abstract: Fault tolerance is a property which needs deeper consideration when dealing with streaming jobs requiring high levels of availability and low-latency processing even in case of failures where Quality-of-Service constraints must be adhered to. Typically, systems achieve fault tolerance and the ability to recover automatically from partial failures by implementing Checkpoint and Rollback Recovery. However, this is an expensive operation which impacts negatively on the overall performance of the system and manual… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2

Relationship

3
4

Authors

Journals

citations
Cited by 10 publications
(7 citation statements)
references
References 15 publications
0
7
0
Order By: Relevance
“…A number of approaches have been proposed that optimize the fault tolerance configuration parameters by finding an optimal checkpoint interval (CI) to improve performance. Our previous work [11] focused on predicting recovery times and then optimizing the CI with regards to a single userdefined QoS constraint. However, this approach is aimed at scenarios where jobs process a static workload, i.e.…”
Section: B Adaptive Checkpointingmentioning
confidence: 99%
See 1 more Smart Citation
“…A number of approaches have been proposed that optimize the fault tolerance configuration parameters by finding an optimal checkpoint interval (CI) to improve performance. Our previous work [11] focused on predicting recovery times and then optimizing the CI with regards to a single userdefined QoS constraint. However, this approach is aimed at scenarios where jobs process a static workload, i.e.…”
Section: B Adaptive Checkpointingmentioning
confidence: 99%
“…Checkpointing mechanisms are among the most popular and effective techniques for achieving fault tolerance in real world processing systems and consequently a number of methods for auto-configuration of checkpointing have been proposed. Most of them try to optimize the checkpoint interval by means of predicting or utilizing failure rates [7], the Mean Time To Failure (MTTF) [8]- [10], or recovery times [11], whereas some approaches even employ advanced multi-level checkpointing [12]- [17]. Yet, the majority of such methods either assume static workloads, consider solely offline optimization, or are primarily designed for high-performance computing (HPC) environments, which renders them not suitable for real-world DSP systems.…”
Section: Introductionmentioning
confidence: 99%
“…Our overall approach borrows from Chiron [4]. Chiron uses a profiling-based approach to measure the capacity of DSP jobs with QoS requirements to find optimal checkpoint intervals.…”
Section: Related Workmentioning
confidence: 99%
“…In order to test the maximum capacity and increase the number of events processed by the DSP job, events are read from an earlier timestamp. All duplicate pipelines read from the same Kafka topic to increase accuracy [4]. Chiron builds on Timon [3], which tests alternate DSP configurations by deploying parallel pipelines that read from production data streams.…”
Section: Related Workmentioning
confidence: 99%
“…Prominent applications within this domain include: the automated dynamic scaling of resources which aims to reduce over-and under-provisioning, minimizing operating costs and preventing possible reductions in the service quality [6]- [9]; The live migration of functionality/state across the network where cluster metrics are collected, processed, and scanned for anomalies which might reveal performance degradations and/or signs of component failure [10]- [12]; and the automatic system tuning where dynamic runtime adjustment of system configurations are performed in order to improve overall system availability and reliability [13]- [15]. The majority of these approaches rely on coarse-grained metrics to reactively make remediation decisions.…”
Section: Introductionmentioning
confidence: 99%