DOI: 10.1007/978-3-540-73940-1_15
|View full text |Cite
|
Sign up to set email alerts
|

Enhancing Fault-Tolerance of Large-Scale MPI Scientific Applications

Abstract: Abstract. The running times of large-scale computational science and engineering parallel applications, executed on clusters or Grid platforms, are usually longer than the mean-time-between-failures (MTBF). Therefore, hardware failures must be tolerated to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery are very useful techniques to implement fault-tolerant applications. Although extensive research has been carried out in this field, there are few available… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(1 citation statement)
references
References 12 publications
0
1
0
Order By: Relevance
“…However, in general, MPI is rarely selected for developing real-time data processing systems because it does not provide standardized fault tolerance interfaces and semantics. Although extensive research [27,28,29] has been conducted in this area, few available tools exist to help parallel programmers enhance their applications with fault tolerance support. Moreover, the exploitation of MPI is impeded by difficulties in software development.…”
Section: Introductionmentioning
confidence: 99%
“…However, in general, MPI is rarely selected for developing real-time data processing systems because it does not provide standardized fault tolerance interfaces and semantics. Although extensive research [27,28,29] has been conducted in this area, few available tools exist to help parallel programmers enhance their applications with fault tolerance support. Moreover, the exploitation of MPI is impeded by difficulties in software development.…”
Section: Introductionmentioning
confidence: 99%