Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2013
DOI: 10.1145/2442516.2442519
|View full text |Cite
|
Sign up to set email alerts
|

Adoption protocols for fanout-optimal fault-tolerant termination detection

Abstract: Termination detection is relevant for signaling completion (all processors are idle and no messages are in flight) of many operations in distributed systems, including work stealing algorithms, dynamic data exchange, and dynamically structured computations. In the face of growing supercomputers with increasing likelihood that each job may encounter faults, it is important for high-performance computing applications that rely on termination detection that such an algorithm be able to tolerate the inevitable fau… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
7
0

Year Published

2014
2014
2021
2021

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 17 publications
0
7
0
Order By: Relevance
“…We discuss some existing fault-tolerant termination detection algorithms, mainly from a functional point of view. Only [16] reports on performance results based on an actual implementation. Generally a complete network topology and a perfect failure detector are required, as such assumptions are essential for developing a fault-tolerant termination detection algorithm, see [20].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…We discuss some existing fault-tolerant termination detection algorithms, mainly from a functional point of view. Only [16] reports on performance results based on an actual implementation. Generally a complete network topology and a perfect failure detector are required, as such assumptions are essential for developing a fault-tolerant termination detection algorithm, see [20].…”
Section: Related Workmentioning
confidence: 99%
“…Lifflander et al [16] proposed a series of algorithms based on [9] that avoid the bottleneck of [15]. These algorithms are resistant to single-node failures but are only probabilistically tolerant to multi-node failures and incur additional control messages even in crash-free executions.…”
Section: Related Workmentioning
confidence: 99%
“…This termination detection model is very similar to Cilk's fully-strict spawn-sync model. Fault-tolerant extensions of the DS algorithm are presented in [6,7].…”
Section: Related Workmentioning
confidence: 99%
“…Lifflander et al [7] took a practical approach for resilient TD of a fully-strict diffusing computation. Based on the assumption that multi-node failures are rare in practice, and that the probability of a k-node failure decreases as k increases, they designed three variants of the DS protocol that can tolerate most but not all failures.…”
Section: Related Workmentioning
confidence: 99%
“…Our work also seeks to reduce the amount of control data required, but it solves a more general problem not specific to a certain programming paradigm. Specific schemes for reducing control-related data have been studied extensively for a wide variety of algorithms [18]- [22].…”
Section: Introductionmentioning
confidence: 99%