2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID) 2008
DOI: 10.1109/ccgrid.2008.103
|View full text |Cite
|
Sign up to set email alerts
|

Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2010
2010
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 20 publications
(11 citation statements)
references
References 2 publications
0
11
0
Order By: Relevance
“…While the latest memory mapping information is maintained and used by the restart operation, the pages saved by the preceding checkpoints but unmapped later are skipped. (4) We conduct experiments on a cluster that quantitatively show significant reductions in the size of checkpoint files and the overhead of checkpoint operations for our hybrid C/R. Hybrid checkpoints save 16 seconds wallclock time on average for all the cases by replacing three full checkpoints with incremental ones while overheads of restarts (if required) are an order of magnitude smaller for our experiments.…”
Section: This Work Was Supported In Part By Nsf Grants Ccr-0237570 (Cmentioning
confidence: 94%
See 1 more Smart Citation
“…While the latest memory mapping information is maintained and used by the restart operation, the pages saved by the preceding checkpoints but unmapped later are skipped. (4) We conduct experiments on a cluster that quantitatively show significant reductions in the size of checkpoint files and the overhead of checkpoint operations for our hybrid C/R. Hybrid checkpoints save 16 seconds wallclock time on average for all the cases by replacing three full checkpoints with incremental ones while overheads of restarts (if required) are an order of magnitude smaller for our experiments.…”
Section: This Work Was Supported In Part By Nsf Grants Ccr-0237570 (Cmentioning
confidence: 94%
“…Recent investigations [4] revealed that checkpoint/restart efficiency, i.e., the ratio of useful vs. scheduled machine time, can be as high as 85% and as low as 55% on current-generation HPC systems. However, only a subset of the process image changes between checkpoints.…”
Section: This Work Was Supported In Part By Nsf Grants Ccr-0237570 (Cmentioning
confidence: 99%
“…This evenly distributed simulated system MTTF applies to each application run separately, i.e., from start to finish/failure and from restart to finish/failure. In this worst case scenario, the application MTTF can differ significantly from the system MTTF [45]. Since the individual checkpoint files are extremely small and xSim's file system model is a work in progress, the file system overhead for checkpoint/restart was not considered in the experiments.…”
Section: Simulated Systemmentioning
confidence: 99%
“…We instead explore mean time to job interrupt and total application runtime [10] for arbitrary redundancy in the compute section. Daly has also developed a model for job interrupts (which he refers to as "application fatal errors") [9], but whereas he focusses on dependency hierarchies to model interrupt rates, we explore redundant computation as an interrupt reduction strategy.…”
Section: Related Workmentioning
confidence: 99%
“…Furthermore, we assume that nodes fail identically (all nodes have the same CDF), and independently (one node's failure does not affect another). These assumptions are common [10,52], but not universal [38,9] regarding HPC.…”
Section: Distribution-independent Formulationmentioning
confidence: 99%