Adoption protocols for fanout-optimal fault-tolerant termination detection

Lifflander, Jonathan; Miller, Pam; Kalé, Laxmikant V.

doi:10.1145/2442516.2442519

Cited by 7 publications

(7 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We discuss some existing fault-tolerant termination detection algorithms, mainly from a functional point of view. Only [16] reports on performance results based on an actual implementation. Generally a complete network topology and a perfect failure detector are required, as such assumptions are essential for developing a fault-tolerant termination detection algorithm, see [20].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Fault-Tolerant Termination Detection with Safra’s Algorithm

2021

View full text Add to dashboard Cite

General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal ? Take down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Lifflander et al [16] proposed a series of algorithms based on [9] that avoid the bottleneck of [15]. These algorithms are resistant to single-node failures but are only probabilistically tolerant to multi-node failures and incur additional control messages even in crash-free executions.…”

Section: Related Workmentioning

confidence: 99%

Fault-Tolerant Termination Detection with Safra’s Algorithm

2021

View full text Add to dashboard Cite

show abstract

“…This termination detection model is very similar to Cilk's fully-strict spawn-sync model. Fault-tolerant extensions of the DS algorithm are presented in [6,7].…”

Section: Related Workmentioning

confidence: 99%

“…Lifflander et al [7] took a practical approach for resilient TD of a fully-strict diffusing computation. Based on the assumption that multi-node failures are rare in practice, and that the probability of a k-node failure decreases as k increases, they designed three variants of the DS protocol that can tolerate most but not all failures.…”

Section: Related Workmentioning

confidence: 99%

Resilient Optimistic Termination Detection for the Async-Finish Model

Hamouda

Milthorpe

2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

2[0000−0001−7300−9565] and Josh Milthorpe 1[0000−0002−3588−9896]Abstract. Driven by increasing core count and decreasing mean-time-to-failure in supercomputers, HPC runtime systems must improve support for dynamic task-parallel execution and resilience to failures. The async-finish task model, adapted for distributed systems as the asynchronous partitioned global address space programming model, provides a simple way to decompose a computation into nested task groups, each managed by a 'finish' that signals the termination of all tasks within the group. For distributed termination detection, maintaining a consistent view of task state across multiple unreliable processes requires additional book-keeping when creating or completing tasks and finish-scopes. Runtime systems which perform this book-keeping pessimistically, i.e. synchronously with task state changes, add a high communication overhead compared to non-resilient protocols. In this paper, we propose optimistic finish, the first message-optimal resilient termination detection protocol for the async-finish model. By avoiding the communication of certain task and finish events, this protocol allows uncertainty about the global structure of the computation which can be resolved correctly at failure time, thereby reducing the overhead for failure-free execution. Performance results using micro-benchmarks and the LULESH hydrodynamics proxy application show significant reductions in resilience overhead with optimistic finish compared to pessimistic finish. Our optimistic finish protocol is applicable to any task-based runtime system offering automatic termination detection for dynamic graphs of non-migratable tasks.Recent advances in high-performance computing (HPC) systems have greatly increased parallelism, with both larger numbers of nodes, and larger core counts within each node. With increased system size and complexity comes an increase in the expected rate of failures. Programmers of HPC systems must therefore address the twin challenges of efficiently exploiting available parallelism and ensuring resilience to component failures. As more industrial and scientific communities rely on HPC to drive innovation, there is a need for productive programming models for scalable resilient applications.

show abstract

“…Our work also seeks to reduce the amount of control data required, but it solves a more general problem not specific to a certain programming paradigm. Specific schemes for reducing control-related data have been studied extensively for a wide variety of algorithms [18]- [22].…”

Section: Introductionmentioning

confidence: 99%

Scalable replay with partial-order dependencies for message-logging fault tolerance

Lifflander

Meneses

Menon

et al. 2014

2014 IEEE International Conference on Cluster Computing (CLUSTER)

Self Cite

View full text Add to dashboard Cite

Abstract-Deterministic replay of a parallel application is commonly used for discovering bugs or to recover from a hard fault with message-logging fault tolerance. For message passing programs, a major source of overhead during forward execution is recording the order in which messages are sent and received. During replay, this ordering must be used to deterministically reproduce the execution. Previous work in replay algorithms often makes minimal assumptions about the programming model and application to maintain generality. However, in many applications, only a partial order must be recorded due to determinism intrinsic in the program, ordering constraints imposed by the execution model, and events that are commutative (their relative execution order during replay does not need to be reproduced exactly). In this paper, we present a novel algebraic framework for reasoning about the minimum dependencies required to represent the partial order for different orderings and interleavings. By exploiting this framework, we improve on an existing scalable message-logging fault tolerance scheme that uses a total order. The improved scheme scales to 131,072 cores on an IBM BlueGene/P with up to 2× lower overhead.

show abstract

Adoption protocols for fanout-optimal fault-tolerant termination detection

Cited by 7 publications

References 17 publications

Fault-Tolerant Termination Detection with Safra’s Algorithm

Fault-Tolerant Termination Detection with Safra’s Algorithm

Resilient Optimistic Termination Detection for the Async-Finish Model

Scalable replay with partial-order dependencies for message-logging fault tolerance

Contact Info

Product

Resources

About