2014
DOI: 10.1177/1094342014522573
|View full text |Cite
|
Sign up to set email alerts
|

Addressing failures in exascale computing

Abstract: We present here a report produced by a workshop on 'Addressing failures in exascale computing' held in Park City, Utah, 4-11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
96
0
1

Year Published

2015
2015
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 282 publications
(97 citation statements)
references
References 134 publications
(160 reference statements)
0
96
0
1
Order By: Relevance
“…In contrast, the Cray XK6/XK7 (Titan) at Oak Ridge National Laboratory (10-20/27 petaflops) achieves a MTBI of 132/173 h [2]. The anticipated failure rate of an exascale machine is likely to be higher than present systems [8,9,23,28] and therefore application resilience is critical in maintaining the usefulness of any future exascale system.…”
Section: Introductionmentioning
confidence: 90%
See 1 more Smart Citation
“…In contrast, the Cray XK6/XK7 (Titan) at Oak Ridge National Laboratory (10-20/27 petaflops) achieves a MTBI of 132/173 h [2]. The anticipated failure rate of an exascale machine is likely to be higher than present systems [8,9,23,28] and therefore application resilience is critical in maintaining the usefulness of any future exascale system.…”
Section: Introductionmentioning
confidence: 90%
“…Algorithm and software resilience is now one of the greatest concerns in striving towards exascale and interruption, due to component failure, is now considered a major barrier to effectively using an exascale system with current numerical codes [9,28]. Both hardware and software errors, such as component failures or operating system crashes, may interrupt simulations or lead to non-deterministic results [27].…”
Section: Introductionmentioning
confidence: 99%
“…2. With probability 1 2 the error has struck in the other 2 x−1 nodes and we don't need to recompute any of the first 2 x−1 nodes. We can write…”
Section: Abfrmentioning
confidence: 99%
“…Future large-scale systems are projected to have higher error rates, with MTBFs (Mean Time Between Failures) as low as 20 minutes [1]. We focus on latent errors, that are not detected immediately after their occurrence.…”
Section: Introductionmentioning
confidence: 99%
“…The issues are different for Supercomputers whose storage nodes typically comprise tens of thousands of individual disks interconnected through a dedicated storage high-speed network, and managed by a parallel file system. Due the scale of such infrastructures and the dramatic decrease of the Mean Time Between Failures (MTBF), a lot of papers consider application checkpointing [1]. Accurately modeling and simulating the impact of reading and writing checkpointed data on disks is thus crucial to design efficient policies.…”
Section: Introductionmentioning
confidence: 99%