Projections and measurements of error rates in near-exascale and exascale systems suggest a dramatic growth, due to extreme scale (10 9 cores), concurrency, software complexity, and deep submicron transistor scaling. Such a growth makes resilience a critical concern, and may increase the incidence of errors that "escape", silently corrupting application state. Such errors can often be revealed by application software tests but with long latencies, and thus are known as latent errors. We explore how to efficiently recover from latent errors, with an approach called application-based focused recovery (ABFR). Specifically we present a case study of stencil computations, a widely useful computational structure, showing how ABFR focuses recovery effort where needed, using intelligent testing and pruning to reduce recovery effort, and enables recovery effort to be overlapped with application computation. We analyze and characterize the ABFR approach on stencils, creating a performance model parameterized by error rate and detection interval (latency). We compare projections from the model to experimental results with the Chombo stencil application, validating the model and showing that ABFR on stencil can achieve a significant reductions in error recovery cost (up to 400x) and recovery latency (up to 4x). Such reductions enable efficient execution at scale with high latent error rates.
We consider the use of non-volatile memories in the form of burst buffers for resilience in supercomputers. Their cost and limited lifetime demand effective use and appropriate provisioning. We develop an analytic model for the behavior of workloads on systems with burst buffers, and use it to explore questions of cost-effective provisioning, and missiondirected allocation of burst-buffer (SSD) lifetime.First, our results show that system efficiency can be increased by as much as 14% by considering a global perspective (workload mix, job size) for SSD lifetime allocation. Second, with size-based and system-efficiency based lifetime allocation, large jobs suffer as much as 40% job efficiency loss; job-efficiency based allocation must increase their allocations by 50% to eliminate this disparity. Finally, further results suggest that underprovisioning SSD lifetime (only 10-20% of the "optimum" as defined by per-job requirements without resource constraint) is sufficient to produce 90% system efficiency at failure rates three times that of current systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.