2010 10th IEEE International Conference on Computer and Information Technology 2010
DOI: 10.1109/cit.2010.226
|View full text |Cite
|
Sign up to set email alerts
|

MMPI: A Scalable Fault Tolerance Mechanism for MPI Large Scale Parallel Computing

Abstract: At present, Checkpoint/Restart is one of the most popular fault tolerance mechanisms for large scale parallel computing. However, the time to save a global checkpoint reaches and even exceeds the mean-time-between-failures (MTBF) of the component when the performance of the system is between Peta(10 15 ) and Exa(10 18 ) flops, which limits the scalability of the parallel computing. In this paper, a scalable fault tolerance mechanism is designed for MPI-oriented large scale parallel computing, which not only ca… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2011
2011
2022
2022

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(1 citation statement)
references
References 4 publications
0
1
0
Order By: Relevance
“…The recently proposed MMPI [18] offers a set of protocols for redundant execution of MPI applications with different replica partitioning and comparison schemes. It relies entirely on cumbersome source code modifications for implementing the redundancy protocols.…”
Section: Redundant Execution With Mmpimentioning
confidence: 99%
“…The recently proposed MMPI [18] offers a set of protocols for redundant execution of MPI applications with different replica partitioning and comparison schemes. It relies entirely on cumbersome source code modifications for implementing the redundancy protocols.…”
Section: Redundant Execution With Mmpimentioning
confidence: 99%