24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007) 2007
DOI: 10.1109/msst.2007.4367962
|View full text |Cite
|
Sign up to set email alerts
|

Modeling the Impact of Checkpoints on Next-Generation Systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
84
0

Year Published

2008
2008
2019
2019

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 86 publications
(88 citation statements)
references
References 25 publications
1
84
0
Order By: Relevance
“…Based on a study of twenty-two (anonymous) HPC systems, Gibson and Schroeder observe that failure rate grows in proportion to the number of sockets, and give an "optimistic" estimate of 0.1 failures per socket per year [18]. Dual-socket nodes are common in that study, corresponding to a mean time to node failure of 5 years (also used in [31]). ASCI White's MTBI has been reported to be 5 hours in 2001 and 40 hours in 2003 (after the platform had stabilized).…”
Section: Systems Surveymentioning
confidence: 99%
See 3 more Smart Citations
“…Based on a study of twenty-two (anonymous) HPC systems, Gibson and Schroeder observe that failure rate grows in proportion to the number of sockets, and give an "optimistic" estimate of 0.1 failures per socket per year [18]. Dual-socket nodes are common in that study, corresponding to a mean time to node failure of 5 years (also used in [31]). ASCI White's MTBI has been reported to be 5 hours in 2001 and 40 hours in 2003 (after the platform had stabilized).…”
Section: Systems Surveymentioning
confidence: 99%
“…If node mean time to failure (1/λ) is 5-10 years [18,31], it will be more cost-effective to run redundant rather than non-redundant on a system size of 15-20 thousand nodes. If 1/λ is 20 years, crossover occurs at a system size of approximately 28k nodes.…”
Section: Conditions Where Dual-redundant Computation Is Cost Effectivementioning
confidence: 99%
See 2 more Smart Citations
“…This poses challenges for system software in the areas of scalability, resiliency and programmability. The shear scale of future systems necessitate much improved resiliency, or else the majority of an application's runtime will be spent in overhead due to checkpoints and restarts [32]. Programming models are another area requiring attention.…”
Section: Introductionmentioning
confidence: 99%