15th EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing (PDP'07) 2007
DOI: 10.1109/pdp.2007.44
|View full text |Cite
|
Sign up to set email alerts
|

Fault-tolerant solutions for a MPI compute intensive application

Abstract: Abstract

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
4
0

Year Published

2009
2009
2014
2014

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 14 publications
0
4
0
Order By: Relevance
“…In [8], for computationally intensive applications using MPI, two approaches for checkpoint based fault tolerance is proposed. Firstly, segment-level solution, an extension of a checkpoint library for sequential codes.…”
Section: Background and Related Workmentioning
confidence: 99%
“…In [8], for computationally intensive applications using MPI, two approaches for checkpoint based fault tolerance is proposed. Firstly, segment-level solution, an extension of a checkpoint library for sequential codes.…”
Section: Background and Related Workmentioning
confidence: 99%
“…Mourino et al propose two approaches for checkpoint based fault tolerance in computationally intensive applications using MPI [7]. Firstly, segment-level solution, an extension of a checkpoint library for sequential codes.…”
Section: Introductionmentioning
confidence: 99%
“…It is not desirable to have to restart a job from the beginning if it has been executing for hours or days or months [6]. A key challenge in maintaining the seamless (or near seamless) execution of such jobs in the event of failures is addressed under research in fault tolerance [7,8,9,10].Many jobs rely on fault tolerant approaches that are implemented in the middleware supporting the job (for example [6,11,12,13]). The conventional fault tolerant mechanism supported by the middleware is checkpointing [14,15,16,17], which involves the periodic recording of intermediate states of execution of a job to which execution can be returned if a fault occurs.…”
mentioning
confidence: 99%
“…Many jobs rely on fault tolerant approaches that are implemented in the middleware supporting the job (for example [6,11,12,13]). The conventional fault tolerant mechanism supported by the middleware is checkpointing [14,15,16,17], which involves the periodic recording of intermediate states of execution of a job to which execution can be returned if a fault occurs.…”
mentioning
confidence: 99%