2008
DOI: 10.1137/040620394
|View full text |Cite
|
Sign up to set email alerts
|

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment

Abstract: Several recovery techniques for parallel iterative methods are presented. First, the implementation of checkpoints in parallel iterative methods is described and analyzed. Then a simple checkpoint-free fault-tolerant scheme for parallel iterative methods, the lossy approach, is presented. When one processor fails and all its data is lost, the system is recovered by computing a new approximate solution using the data of the nonfailed processors. The iterative method is then restarted with this new vector. The m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
55
0

Year Published

2013
2013
2021
2021

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 45 publications
(63 citation statements)
references
References 8 publications
1
55
0
Order By: Relevance
“…However, if simultaneous errors impact a single relation we have two possible scenarios: 1) Simultaneous errors in a single vector are not a problem for our recovery strategy. This is trivial for vectors recovered from linear relations, and straightforward for submatrix relations [29]. For two failed blocks i and j, we can combine both block relations:…”
Section: Dealing With Multiple Errorsmentioning
confidence: 99%
See 3 more Smart Citations
“…However, if simultaneous errors impact a single relation we have two possible scenarios: 1) Simultaneous errors in a single vector are not a problem for our recovery strategy. This is trivial for vectors recovered from linear relations, and straightforward for submatrix relations [29]. For two failed blocks i and j, we can combine both block relations:…”
Section: Dealing With Multiple Errorsmentioning
confidence: 99%
“…Also, very aggressive resilience strategies like process triplication are completely impractical unless we face very high fault rates [18]. Therefore, intermediate solutions that recompute an approximation of the lost data [29] or that save the process state in a checkpoint with a certain frequency have been extensively used [12], [32], [37]. However, most of these solutions involve backward recoveries, discarding useful computations, and thus incur significant slowdowns.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…In a distributed environment the major cost of this method comes from obtaining the consistent snapshots and disk access to write the snapshots, which highlights the major drawback of such approaches, the relatively high overhead. Langou and Dongarra [36] investigated several checkpoint/recovery techniques and a checkpoint-free lossy fault tolerant technique for parallel iterative methods. Robert and Vivien [10,12] presented a unified model for several common checkpoint/restart protocols, extended in [16] to cover process replication.…”
Section: Related Workmentioning
confidence: 99%