ACM/IEEE SC 2006 Conference (SC'06) 2006
DOI: 10.1109/sc.2006.15
|View full text |Cite
|
Sign up to set email alerts
|

Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI

Abstract: A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPI has led to the development of several fault tolerant MPI environments. Different approaches are being proposed using a variety of fault tolerant message passing protocols based on coordinated checkpointing or mes… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
23
0
1

Year Published

2008
2008
2021
2021

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 40 publications
(24 citation statements)
references
References 18 publications
0
23
0
1
Order By: Relevance
“…In the case of application-level CR, a synchronization point is typically used right before checkpointing in order to guarantee that all messages have been consumed. A similar technique is also widely leveraged in practice for system level checkpointing that uses a coordinated protocol [14], both for the blocking and non-blocking case. More recently, uncoordinated checkpointing protocols, which previously received little attention in practice due the cost and complexity introduced by message logging [6] have been increasingly considered for certain classes of HPC applications [23].…”
Section: How To Capture the State Of The Applicationmentioning
confidence: 99%
“…In the case of application-level CR, a synchronization point is typically used right before checkpointing in order to guarantee that all messages have been consumed. A similar technique is also widely leveraged in practice for system level checkpointing that uses a coordinated protocol [14], both for the blocking and non-blocking case. More recently, uncoordinated checkpointing protocols, which previously received little attention in practice due the cost and complexity introduced by message logging [6] have been increasingly considered for certain classes of HPC applications [23].…”
Section: How To Capture the State Of The Applicationmentioning
confidence: 99%
“…This algorithm uses markers to coordinate the backup, and operates under the assumption of FIFO channels. In [3], a comparison of protocols for coordinated checkpoint blocking and non-blocking has been made. Experiments have shown that the synchronization between nodes induced by the protocol blocking further penalize the performance of the calculation with a non-blocking protocol.…”
Section: ) Non-blocking Coordinated Checkpointingmentioning
confidence: 99%
“…However, most of the researches of checkpointing for MPI programs are focusing on coordinated checkpointing. [12] compared blocking with non-blocking coordinated checkpointing for largescale fault tolerant MPI programs. The authors found out that for high-speed networks, the blocking implementation gives the best performance for sensible checkpoint frequency.…”
Section: Related Workmentioning
confidence: 99%