GPU Behavior on a Large HPC Cluster

DeBardeleben, Nathan; Blanchard, Sean; Monroe, Laura; Romero, Philip; Grunau, Daryl; Idler, Craig; Wright, Cornell

doi:10.1007/978-3-642-54420-0_66

Cited by 19 publications

(10 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hard errors such as the ones we observed may never be detected in production environments without repeated runs of bitwise reproducible codes, and systems can pass standard acceptance testing while exhibiting considerable error on other calculations . The crashes and unreasonable energies observed in the worst of the failed runs may be written off as a program bug, and more concerningly, the subtle errors in final energy may be accepted as valid results of a code that produces a different answer each time because of rounding differences.…”

Section: Discussionmentioning

confidence: 85%

An investigation of the effects of hard and soft errors on graphics processing unit‐accelerated molecular dynamics simulations

Betz

DeBardeleben

Walker

2014

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARYMolecular dynamics (MD) simulations rely on the accurate evaluation and integration of Newton's equations of motion to propagate the positions of atoms in proteins during a simulation. As such, one can expect them to be sensitive to any form of numerical error that may occur during a simulation. Increasingly graphics processing units (GPUs) are being used to accelerate MD simulations. Current GPU architectures designed for high performance computing applications support error‐correcting codes (ECC) that detect and correct single bit‐flip soft error events in GPU memory; however, this error checking carries a penalty in terms of simulation speed. ECC is also a major distinguishing feature between high performance computing NVIDIA Tesla cards and the considerably more cost‐effective NVIDIA GeForce gaming cards. An argument often put forward for not using GeForce cards is that the results are unreliable because of the lack of ECC. In an initial attempt to quantify these concerns, an investigation of the reproducibility of GPU‐accelerated MD simulations using the AMBER software was conducted on the XSEDE supercomputer Keeneland, a cluster at Los Alamos National Laboratory, and a cluster at the San Diego Supercomputer Center. While the data collected are insufficient to make solid conclusions and more extensive testing is needed to provide quantitative statistics, the absence of ECC events and lack of any silent errors in all the simulations conducted to date suggest that these errors are exceedingly rare and as such the time and memory penalty of ECC may outweigh the utility of error checking functionality. However, a considerable amount of error originating from defective hardware was observed, which suggests that rigorous acceptance testing should be performed on new GPU‐based systems by repeatedly running reproducible yet realistic calculations.

show abstract

Section: Discussionmentioning

confidence: 85%

An investigation of the effects of hard and soft errors on graphics processing unit‐accelerated molecular dynamics simulations

Betz

DeBardeleben

Walker

2014

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…Various studies have been conducted for understanding the reliability aspects of using GPU's in large-scale HPC systems. The studies suggest that the newer generation GPU's are more reliable, as are the large-scale HPC systems using them (i.e., the observed MTBF of systems using newer GPU's is much longer than their estimated MTBF) [2][3][4][5][6][7].…”

Section: Gpus For Exascale: Dues and Gpu Reliabilitymentioning

confidence: 99%

“…Unfortunately, GPUs have been shown to suffer from a high rate of Detected Unrecoverable Errors (DUEs) [2][3][4][5][6][7]. The mean time between failures (MTBF) is expected to become much worse as the number of compute nodes increases in the exascale generation.…”

Section: Introductionmentioning

confidence: 99%

CRUM: Checkpoint-Restart Support for CUDA's Unified Memory

Garg

Mohan

Sullivan

et al. 2018

2018 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

Unified Virtual Memory (UVM) was recently introduced on recent NVIDIA GPUs. Through software and hardware support, UVM provides a coherent shared memory across the entire heterogeneous node, migrating data as appropriate. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs have started taking advantage of UVM for the same reasons of superior programmability that UNIX applications long ago switched to assuming the presence of virtual memory. Therefore, checkpointing of UVM will become increasingly important, especially as NVIDIA CUDA continues to gain wider popularity: 87 of the top 500 supercomputers in the latest listings are GPU-accelerated, with a current trend of ten additional GPU-based supercomputers each year.A new scalable checkpointing mechanism, CRUM (Checkpoint-Restart for Unified Memory), is demonstrated for hybrid CUDA/MPI computations across multiple computer nodes. CRUM supports a fast, forked checkpointing, which mostly overlaps the CUDA computation with storage of the checkpoint image in stable storage. The runtime overhead of using CRUM is 6% on average, and the time for forked checkpointing is seen to be a factor of up to 40 times less than traditional, synchronous checkpointing.

show abstract

“…Highly parallel computing architectures, like the Xeon Phi, have some reliability weaknesses [15,16,21,49]. For instance, a single particle generating a radiation-induced failure in the scheduler or shared memories (used to expedite parallel executions), is likely to affect the computation of several parallel threads.…”

Section: Background 21 Transient Errors Effects In Hpcmentioning

confidence: 99%

Experimental and analytical study of Xeon Phi reliability

Oliveira

Pilla

DeBardeleben

et al. 2017

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

View full text Add to dashboard Cite

We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon Phi processors based on radiation experiments and high-level fault injection. Besides measuring the realistic error rates of Xeon Phi, we quantify Silent Data Corruption (SDCs) by correlating the distribution of corrupted elements in the output to the application's characteristics. We evaluate the benefits of imprecise computing for reducing the programs' error rate. For example, for HotSpot a 0.5% tolerance in the output value reduces the error rate by 85%.We inject different fault models to analyze the sensitivity of given applications. We show that portions of applications can be graded by different criticalities. For example, faults occurring in the middle of LUD execution, or in the Sort and Tree portions of CLAMR, are more critical than the remaining portions. Mitigation techniques can then be relaxed or hardened based on the criticality of the particular portions.

show abstract

GPU Behavior on a Large HPC Cluster

Cited by 19 publications

References 4 publications

An investigation of the effects of hard and soft errors on graphics processing unit‐accelerated molecular dynamics simulations

An investigation of the effects of hard and soft errors on graphics processing unit‐accelerated molecular dynamics simulations

CRUM: Checkpoint-Restart Support for CUDA's Unified Memory

Experimental and analytical study of Xeon Phi reliability

Contact Info

Product

Resources

About