2011
DOI: 10.1007/978-3-642-23400-2_53
|View full text |Cite
|
Sign up to set email alerts
|

On the Use of Cluster-Based Partial Message Logging to Improve Fault Tolerance for MPI HPC Applications

Abstract: Fault tolerance is becoming a major concern in HPC systems. The two traditional approaches for message passing applications, coordinated checkpointing and message logging, have severe scalability issues. Coordinated checkpointing protocols make all processes roll back after a failure. Message logging protocols log a huge amount of data and can induce an overhead on communication performance. Hierarchical rollback-recovery protocols based on the combination of coordinated checkpointing and message logging are a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
36
0
3

Year Published

2012
2012
2019
2019

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 29 publications
(40 citation statements)
references
References 16 publications
1
36
0
3
Order By: Relevance
“…As a result of this lack of empirical study, a sizable body of current Cloud dependability mechanisms and workload characterization research is derived from analysis of other distributed systems [8][9][10] or incorporates theoretical values [11][12][13][14]. While such work is relevant to enhancing Cloud dependability mechanisms and workload characterization, it comes with significant limitations.…”
Section: Introductionmentioning
confidence: 99%
“…As a result of this lack of empirical study, a sizable body of current Cloud dependability mechanisms and workload characterization research is derived from analysis of other distributed systems [8][9][10] or incorporates theoretical values [11][12][13][14]. While such work is relevant to enhancing Cloud dependability mechanisms and workload characterization, it comes with significant limitations.…”
Section: Introductionmentioning
confidence: 99%
“…FT does not provide such good results because of the use of all-to-all communication primitives. Note that the results presented in [28] for the same applications run over 1024 processes show a better trade-off between clusters size and amount of data logged: less than 15% of processes to roll back with the same amount of logged data. Figure 5 compares MPICH2 native communications performance over Myrinet 10G, to the performance provided by HydEE for two processes in the same cluster (without logging), and for two processes belonging to different clusters (with logging), using Netpipe [30].…”
Section: A Prototype Descriptionmentioning
confidence: 92%
“…To do so, we use the tool described in [28]. It tries to find a clustering configuration that provides a good trade-off between size of the clusters and amount of communications to log.…”
Section: A Prototype Descriptionmentioning
confidence: 99%
See 1 more Smart Citation
“…Ropars et al introduced a process clustering technique [47] that leverages the regular communication patterns of MPI collectives and uses a bisection-based graph partitioning algorithm to compute process clusters that facilitate partial message-logging protocols. For empirical evaluation, Ropars' clustering algorithm is coupled with the above-cited partial message-logging protocols and ran against set an HPC benchmarks including the NAS Parallel Benchmarks and LAMMPS.…”
Section: Fault-tolerance-centric Techniquesmentioning
confidence: 99%