On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications

Ropars, Thomas; Guermouche, Amina; Uçar, Bora; Meneses, Esteban; Kalé, Laxmikant V.; Cappello, Franck

doi:10.1007/978-3-642-23400-2_53

Cited by 29 publications

(40 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…As a result of this lack of empirical study, a sizable body of current Cloud dependability mechanisms and workload characterization research is derived from analysis of other distributed systems [8][9][10] or incorporates theoretical values [11][12][13][14]. While such work is relevant to enhancing Cloud dependability mechanisms and workload characterization, it comes with significant limitations.…”

Section: Introductionmentioning

confidence: 99%

An Empirical Failure-Analysis of a Large-Scale Cloud Computing Environment

Garraghan

Townend

2014

2014 IEEE 15th International Symposium on High-Assurance Systems Engineering

View full text Add to dashboard Cite

Abstract-Cloud computing research is in great need of statistical parameters derived from the analysis of real-world systems. One aspect of this is the failure characteristics of Cloud environments composed of workloads and servers; currently, few metrics are available that quantify failure and repair times of workloads and servers at a large-scale. Workload metrics in particular are critical for characterizing and modeling accurate workload behavior, enabling more realistic workload simulation and failure scenarios of systems. This paper presents the analysis of failure data of a large-scale production Cloud environment (consisting of over 12,500 servers), and includes a study of failure and repair times and characteristics for both Cloud workloads and servers. Our results show that failure characteristics for workload and servers are highly variable and that production Cloud workloads can be accurately modeled by a Gamma distribution. Repair times range between 30 seconds to 4 days, and 25 minutes to 8 days, for workloads and servers respectively.

show abstract

Section: Introductionmentioning

confidence: 99%

An Empirical Failure-Analysis of a Large-Scale Cloud Computing Environment

Garraghan

Townend

2014

2014 IEEE 15th International Symposium on High-Assurance Systems Engineering

View full text Add to dashboard Cite

show abstract

“…FT does not provide such good results because of the use of all-to-all communication primitives. Note that the results presented in [28] for the same applications run over 1024 processes show a better trade-off between clusters size and amount of data logged: less than 15% of processes to roll back with the same amount of logged data. Figure 5 compares MPICH2 native communications performance over Myrinet 10G, to the performance provided by HydEE for two processes in the same cluster (without logging), and for two processes belonging to different clusters (with logging), using Netpipe [30].…”

Section: A Prototype Descriptionmentioning

confidence: 92%

“…To do so, we use the tool described in [28]. It tries to find a clustering configuration that provides a good trade-off between size of the clusters and amount of communications to log.…”

Section: A Prototype Descriptionmentioning

confidence: 99%

See 1 more Smart Citation

HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications

Guermouche

Ropars

Snir

et al. 2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium

Self Cite

View full text Add to dashboard Cite

Abstract-High performance computing will probably reach exascale in this decade. At this scale, mean time between failures is expected to be a few hours. Existing fault tolerant protocols for message passing applications will not be efficient anymore since they either require a global restart after a failure (checkpointing protocols) or result in huge memory occupation (message logging). Hybrid fault tolerant protocols overcome these limits by dividing applications processes into clusters and applying a different protocol within and between clusters. Combining coordinated checkpointing inside the clusters and message logging for the inter-cluster messages allows confining the consequences of a failure to a single cluster, while logging only a subset of the messages. However, in existing hybrid protocols, event logging is required for all application messages to ensure a correct execution after a failure. This can significantly impair failure free performance. In this paper, we propose HydEE, a hybrid rollback-recovery protocol for send-deterministic message passing applications, that provides failure containment without logging any event, and only a subset of the application messages. We prove that HydEE can handle multiple concurrent failures by relying on the senddeterministic execution model. Experimental evaluations of our implementation of HydEE in the MPICH2 library show that it introduces almost no overhead on failure free execution.

show abstract

“…Ropars et al introduced a process clustering technique [47] that leverages the regular communication patterns of MPI collectives and uses a bisection-based graph partitioning algorithm to compute process clusters that facilitate partial message-logging protocols. For empirical evaluation, Ropars' clustering algorithm is coupled with the above-cited partial message-logging protocols and ran against set an HPC benchmarks including the NAS Parallel Benchmarks and LAMMPS.…”

Section: Fault-tolerance-centric Techniquesmentioning

confidence: 99%

Record-and-Replay Techniques for HPC Systems: A Survey

Chapp

Sato

Ahn

et al. 2018

JSFI

View full text Add to dashboard Cite

Record-and-replay techniques provide the ability to record executions of nondeterministic applications and re-execute them identically. These techniques find use in the contexts of debugging, reproducibility, and fault-tolerance, especially in the presence of nondeterministic factors such as message races. Record-and-replay techniques are highly diverse in terms of the fidelity of replay they provide, the assumptions they make about the recorded application, the programming models they target, and the runtime overheads they impose. In the high performance computing (HPC) environment, all the above factors must be considered in concert, thus presenting additional implementation challenges. In this manuscript, we survey record-and-replay techniques in terms of the programming models they target and the workloads on which they were evaluated, providing a categorization of these techniques benefiting application developers and researchers targeting exascale challenges. This manuscript answers three questions through this survey: What are the gaps in the existing space of record-and-replay techniques? What is the roadmap to widespread use of record-and-replay on production-scale HPC workloads? And, what are the critical open problems that must be addressed to make record-and-replay viable at exascale?

show abstract

On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications

Cited by 29 publications

References 16 publications

An Empirical Failure-Analysis of a Large-Scale Cloud Computing Environment

An Empirical Failure-Analysis of a Large-Scale Cloud Computing Environment

HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications

Record-and-Replay Techniques for HPC Systems: A Survey

Contact Info

Product

Resources

About