Fault-tolerant stream processing using a distributed, replicated file system

Kwon, Young-Hyuk; Balazinska, Magdalena; Greenberg, Albert

doi:10.14778/1453856.1453920

Cited by 61 publications

(48 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We assume that individual operator partitions are deterministic, i.e., an operator partition produces an identical output when it processes the same input tuples in the same order. This is a common assumption [20,44,36,2,35,23] and most relational operators are deterministic. In a distributed system, however, the order in which input tuples reach an operator partition may not be deterministic.…”

Section: Model and Assumptionsmentioning

confidence: 99%

“…As an optimization, operators can checkpoint only delta-changes of their state [11]. Other optimizations are also possible [11,19,23] and can be used with our framework.…”

Section: Concrete Framework Instancementioning

confidence: 99%

“…In Section 4.2, however, we showed, how to efficiently implement three well-known fault-tolerance strategies for generic stateless and stateful operators. Existing libraries can also help with such implementation (e.g., [23]). Developers must also (b) model their operator costs within a pipelined query plan.…”

Section: Approach Implementabilitymentioning

confidence: 99%

See 2 more Smart Citations

A latency and fault-tolerance optimizer for online parallel query plans

Upadhyaya

Kwon

Balazinska

2011

Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data

Self Cite

View full text Add to dashboard Cite

We address the problem of making online, parallel query plans fault-tolerant: i.e., provide intra-query fault-tolerance without blocking. We develop an approach that not only achieves this goal but does so through the use of different fault-tolerance techniques at different operators within a query plan. Enabling each operator to use a different faulttolerance strategy leads to a space of fault-tolerance plans amenable to cost-based optimization. We develop FTOpt, a cost-based fault-tolerance optimizer that automatically selects the best strategy for each operator in a query plan in a manner that minimizes the expected processing time with failures for the entire query. We implement our approach in a prototype parallel query-processing engine. Our experiments demonstrate that (1) there is no single best fault-tolerance strategy for all query plans, (2) often hybrid strategies that mix-and-match recovery techniques outperform any uniform strategy, and (3) our optimizer correctly identifies winning fault-tolerance configurations.

show abstract

Section: Model and Assumptionsmentioning

confidence: 99%

“…As an optimization, operators can checkpoint only delta-changes of their state [11]. Other optimizations are also possible [11,19,23] and can be used with our framework.…”

Section: Concrete Framework Instancementioning

confidence: 99%

See 1 more Smart Citation

A latency and fault-tolerance optimizer for online parallel query plans

Upadhyaya

Kwon

Balazinska

2011

Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data

Self Cite

View full text Add to dashboard Cite

show abstract

“…For this, we are implementing concurrent copy-onwrite data structures [14]. Further, due to other parallel, replicated data flows, any side-effect of operator migration is very likely to be hidden from the end-clients.…”

Section: Replication-aware Adaptationmentioning

confidence: 99%

“…These techniques either execute all the operator replicas [4,10,20] or consistently copy the state of a subset of replicas onto other replicas [10,11,14]. In contrast to these solutions, our iFlow conducts detouring as soon as it notices a transmission problem.…”

Section: Related Workmentioning

confidence: 99%

Detouring and replication for fast and reliable internet-scale stream processing

McConnell

Ping

Hwang

2010

Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

View full text Add to dashboard Cite

ABSTRACTiFlow is a replication-based system that can achieve both fast and reliable processing of high volume data streams on the Internet scale. iFlow uses a low degree of replication in conjunction with detouring techniques to overcome network congestion and outages. Computation over iFlow can be expressed as a graph of operators. To cope with varying system conditions these operators continually migrate in a manner that improves performance and availability at the same time.In this paper, we first provide an overview of our iFlow system. Next, we detail how our detouring technique works in the face of network failures to provide high availability for time critical applications. The paper also includes a description of our implementation and preliminary evaluation results demonstrating that iFlow outperforms previous solutions with less overhead. Finally, the paper concludes with our plans for enhancing replication and detouring capabilities.

show abstract

Optimizing checkpoint‐based fault‐tolerance in distributed stream processing systems: Theory to practice

2021

View full text Add to dashboard Cite

Fault-tolerance is an essential part of a stream processing system that guarantees data analysis could continue even after failures. State-of-the-art distributed stream processing systems use checkpointing to support fault-tolerance for stateful computations where the state of the computations is periodically persisted. However, the frequency of performing checkpoints impacts the performance (utilization, latency, and throughput) of the system as the checkpointing process consumes resources and time that can be used for actual computations. In practice, systems are often configured to perform checkpoints based on crude values ignoring factors such as checkpoint and restart costs, leading to suboptimal performance. In our previous work, we proposed a theoretical optimal checkpoint interval that maximizes the system utilization for stream processing systems to minimize the impact of checkpointing on system performance.In this article, we investigate the practical benefits of our proposed theoretical optimal by conducting experiments in a real-world cloud setting using different streaming applications; we use Apache Flink, a well-known stream processing system for our experiments. The experiment results demonstrate that an optimal interval can achieve better utilization, confirming the practicality of the theoretical model when applied to real-world applications. We observed utilization improvements from 10% to 200% for a range of failure rates from 0.3 failures per hour to 0.075 failures per minute. Moreover, we explore how performance measures: latency and throughput are affected by the optimal interval.Our observations demonstrate that significant improvements can be achieved using the optimal interval for both latency and throughput.

show abstract

Fault-tolerant stream processing using a distributed, replicated file system

Cited by 61 publications

References 35 publications

A latency and fault-tolerance optimizer for online parallel query plans

A latency and fault-tolerance optimizer for online parallel query plans

Detouring and replication for fast and reliable internet-scale stream processing

Optimizing checkpoint‐based fault‐tolerance in distributed stream processing systems: Theory to practice

Contact Info

Product

Resources

About