Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale

Daly, John T.; Pritchett, Lant; Michalak, Sarah

doi:10.1109/ccgrid.2008.103

Cited by 20 publications

(11 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While the latest memory mapping information is maintained and used by the restart operation, the pages saved by the preceding checkpoints but unmapped later are skipped. (4) We conduct experiments on a cluster that quantitatively show significant reductions in the size of checkpoint files and the overhead of checkpoint operations for our hybrid C/R. Hybrid checkpoints save 16 seconds wallclock time on average for all the cases by replacing three full checkpoints with incremental ones while overheads of restarts (if required) are an order of magnitude smaller for our experiments.…”

Section: This Work Was Supported In Part By Nsf Grants Ccr-0237570 (Cmentioning

confidence: 94%

See 1 more Smart Citation

Hybrid Checkpointing for MPI Jobs in HPC Environments

Wang

Mueller

Engelmann

et al. 2010

2010 IEEE 16th International Conference on Parallel and Distributed Systems

View full text Add to dashboard Cite

show abstract

Section: This Work Was Supported In Part By Nsf Grants Ccr-0237570 (Cmentioning

confidence: 94%

“…Recent investigations [4] revealed that checkpoint/restart efficiency, i.e., the ratio of useful vs. scheduled machine time, can be as high as 85% and as low as 55% on current-generation HPC systems. However, only a subset of the process image changes between checkpoints.…”

Section: This Work Was Supported In Part By Nsf Grants Ccr-0237570 (Cmentioning

confidence: 99%

Hybrid Checkpointing for MPI Jobs in HPC Environments

Wang

Mueller

Engelmann

et al. 2010

2010 IEEE 16th International Conference on Parallel and Distributed Systems

View full text Add to dashboard Cite

show abstract

“…This evenly distributed simulated system MTTF applies to each application run separately, i.e., from start to finish/failure and from restart to finish/failure. In this worst case scenario, the application MTTF can differ significantly from the system MTTF [45]. Since the individual checkpoint files are extremely small and xSim's file system model is a work in progress, the file system overhead for checkpoint/restart was not considered in the experiments.…”

Section: Simulated Systemmentioning

confidence: 99%

Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems

Engelmann

Naughton

2013

2013 42nd International Conference on Parallel Processing

View full text Add to dashboard Cite

Abstract-xSim is a simulation-based performance investigation toolkit that permits running high-performance computing (HPC) applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hardware/software co-design. The presented work details newly developed features for xSim that permit the injection of MPI process failures, the propagation/detection/notification of such failures within the simulation, and their handling using application-level checkpoint/restart. These new capabilities enable the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique.

show abstract

“…We instead explore mean time to job interrupt and total application runtime [10] for arbitrary redundancy in the compute section. Daly has also developed a model for job interrupts (which he refers to as "application fatal errors") [9], but whereas he focusses on dependency hierarchies to model interrupt rates, we explore redundant computation as an interrupt reduction strategy.…”

Section: Related Workmentioning

confidence: 99%

“…Furthermore, we assume that nodes fail identically (all nodes have the same CDF), and independently (one node's failure does not affect another). These assumptions are common [10,52], but not universal [38,9] regarding HPC.…”

Section: Distribution-independent Formulationmentioning

confidence: 99%

A Model-Based Case for Redundant Computation

Stearley

Robinson

Ferreira

et al. 2011

View full text Add to dashboard Cite

Despite its seemingly nonsensical cost, we show through modeling and simulation that redundant computation merits full consideration as a resilience strategy for next-generation systems. Without revolutionary breakthroughs in failure rates, part counts, or stable-storage bandwidths, it has been shown that the utility of Exascale systems will be crushed by the overheads of traditional checkpoint/restart mechanisms. Alternate resilience strategies must be considered, and redundancy is a proven unrivaled approach in many domains. We develop a distribution-independent model for job interrupts on systems of arbitrary redundancy, adapt Daly's model for total application runtime, and find that his estimate for optimal checkpoint interval remains valid for redundant systems. We then identify conditions where redundancy is more cost effective than non-redundancy. These are done in the context of the number one supercomputers of the last decade, showing that thorough consideration of redundant computation is timely -if not overdue.

show abstract

Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale

Cited by 20 publications

References 2 publications

Hybrid Checkpointing for MPI Jobs in HPC Environments

Hybrid Checkpointing for MPI Jobs in HPC Environments

Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems

A Model-Based Case for Redundant Computation

Contact Info

Product

Resources

About