Christian Engelmann scite author profile

Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status.Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from "unhealthy" nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This paper contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen's live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the first comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.

show abstract

Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing

Fiala¹,

Mueller²,

Engelmann³

et al. 2012

118

101

View full text Add to dashboard Cite

Abstract-Faults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores. Exacerbating this situation, some of these faults remain undetected, manifesting themselves as silent errors that corrupt memory while applications continue to operate and report incorrect results. This paper studies the potential for redundancy to both detect and correct soft errors in MPI message-passing applications. Our study investigates the challenges inherent to detecting soft errors within MPI application while providing transparent MPI redundancy. By assuming a model wherein corruption in application data manifests itself by producing differing MPI message data between replicas, we study the best suited protocols for detecting and correcting MPI data that is the result of corruption.To experimentally validate our proposed detection and correction protocols, we introduce RedMPI, an MPI library which resides in the MPI profiling layer. RedMPI is capable of both online detection and correction of soft errors that occur in MPI applications without requiring any modifications to the application source by utilizing either double or triple redundancy.Our results indicate that our most efficient consistency protocol can successfully protect applications experiencing even high rates of silent data corruption with runtime overheads between 0% and 30% as compared to unprotected applications without redundancy.Using our fault injector within RedMPI, we observe that even a single soft error can have profound effects on running applications, causing a cascading pattern of corruption in most cases causes that spreads to all other processes. RedMPI's protection has been shown to successfully mitigate the effects of soft errors while allowing applications to complete with correct results even in the face of errors.

show abstract

Addressing failures in exascale computing

Snir

Wisniewski

Abraham

et al. 2014

The International Journal of High Performance Computing Applica

282

View full text Add to dashboard Cite

We present here a report produced by a workshop on 'Addressing failures in exascale computing' held in Park City, Utah, 4-11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach.The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.

show abstract

Combining Partial Redundancy and Checkpointing for HPC

et al. 2012

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.