2020
DOI: 10.1177/1094342020901885
|View full text |Cite
|
Sign up to set email alerts
|

Overhead of using spare nodes

Abstract: With the increasing fault rate on high-end supercomputers, the topic of fault tolerance has been gathering attention. To cope with this situation, various fault-tolerance techniques are under investigation; these include user-level, algorithm-based fault-tolerance techniques and parallel execution environments that enable jobs to continue following node failure. Even with these techniques, some programs with static load balancing, such as stencil computation, may underperform after a failure recovery. Even whe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 24 publications
0
3
0
Order By: Relevance
“…[22]), but this is beyond the scope of this paper. More generally speaking, depending on how exactly CR is implemented, it could make use of spare nodes, enabling the application to keep using most of the nodes already allocated to it, but making it necessary to identify the identity of the lost node just as in the case of ESR; or the whole application could be restarted on newly-allocated nodes, although this is likely to be more costly that identifying the lost nodes, particularly at greater scales [6,12,26].…”
Section: Beyond Node-failure Simulationmentioning
confidence: 99%
See 1 more Smart Citation
“…[22]), but this is beyond the scope of this paper. More generally speaking, depending on how exactly CR is implemented, it could make use of spare nodes, enabling the application to keep using most of the nodes already allocated to it, but making it necessary to identify the identity of the lost node just as in the case of ESR; or the whole application could be restarted on newly-allocated nodes, although this is likely to be more costly that identifying the lost nodes, particularly at greater scales [6,12,26].…”
Section: Beyond Node-failure Simulationmentioning
confidence: 99%
“…The work mentioned so far supposes the availability of spare nodes. In [12], Hori et al propose strategies for the allocation of these spare nodes, and the replacement of lost nodes, when runtime performance is of consideration.…”
Section: Related Workmentioning
confidence: 99%
“…However, it is often the case that some resources are not used while there are jobs in the queue, since the resource requirements of the waiting jobs are greater than the available resources. In addition, some jobs in progress do not use all their resources efficiently during execution, for example because they do not use all of their resources during the entire execution, or because they have spare nodes for fault tolerance [3].…”
Section: Introductionmentioning
confidence: 99%