Proceedings of the 2nd Workshop on Middleware for Grid Computing - 2004
DOI: 10.1145/1028493.1028499
|View full text |Cite
|
Sign up to set email alerts
|

Checkpointing-based rollback recovery for parallel applications on the InteGrade grid middleware

Abstract: InteGrade is a grid middleware infrastructure that enables the use of idle computing power from user workstations. One of its goals is to support the execution of long-running parallel applications that present a considerable amount of communication among application nodes. However, in an environment composed of shared user workstations spread across many different LANs, machines may fail, become unaccessible, or may switch from idle to busy very rapidly, compromising the execution of the parallel application … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
11
0
2

Year Published

2005
2005
2011
2011

Publication Types

Select...
3
3
3

Relationship

1
8

Authors

Journals

citations
Cited by 21 publications
(14 citation statements)
references
References 15 publications
1
11
0
2
Order By: Relevance
“…The adoption of a checkpoint-based approach introduces, independently of the occurrence of failures, an overhead to the normal application execution time. Integrade checkpointing implementation minimizes this overhead by copying the checkpoint data to a buffer and performing the coding and transfer of checkpoints through a separate application thread, allowing the application to concurrently continue its execution [3].…”
Section: Integrade Fault Tolerance Mechanismmentioning
confidence: 99%
See 1 more Smart Citation
“…The adoption of a checkpoint-based approach introduces, independently of the occurrence of failures, an overhead to the normal application execution time. Integrade checkpointing implementation minimizes this overhead by copying the checkpoint data to a buffer and performing the coding and transfer of checkpoints through a separate application thread, allowing the application to concurrently continue its execution [3].…”
Section: Integrade Fault Tolerance Mechanismmentioning
confidence: 99%
“…Checkpoint data recovery also involves a query to the CDRM, requesting the list of ADRs where the application checkpoints were stored. More details about Integrade checkpointing mechanism can be found in [2], [3]. Figure 2 illustrates our implementation of the Integrade protocol for executing application replicas.…”
Section: Integrade Fault Tolerance Mechanismmentioning
confidence: 99%
“…There is now a rich literature on checkpointing techniques for parallel computation on a cluster [1,7,25,26,28,34,35]. Nevertheless, a thorny issue remains.…”
Section: Introductionmentioning
confidence: 99%
“…They provide a C API that allows applications written in both C and C++ to use them. The basic checkpointing functionality is provided by functions to manipulate the checkpoint stack, to save the stack data to a file, and to recover checkpointing data [28].…”
Section: The Integrade Grid Middlewarementioning
confidence: 99%