Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery

Ashraf, Rizwan A.; Hukerikar, Saurabh; Engelmann, Christian

doi:10.1109/pdp2018.2018.00032

Cited by 17 publications

(13 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Although not yet part of the standard, the proposed ULFM extensions have already been applied in a number of studies (see, e.g. Ali et al, 2014; Ashraf et al, 2018; Bland et al, 2013; Cantwell and Nielsen, 2019; Engwer et al, 2018; Fagg and Dongarra, 2000; Gamell et al, 2017a, 2017b; Losada et al, 2020; Teranishi and Heroux, 2014).…”

Section: Resilience Methodologiesmentioning

confidence: 99%

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

Benacchio

Bonaventura

Altenbernd

et al. 2021

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.

show abstract

Section: Resilience Methodologiesmentioning

confidence: 99%

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

Benacchio

Bonaventura

Altenbernd

et al. 2021

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

show abstract

“…Then, the MPIX Comm shrink routine is used to create a new communicator excluding the failed processes. An example of this kind of applications is found in [3], where an iterative application is rendered moldable by redistributing the rest of the dataset among the surviving processes.…”

Section: Typical Patternsmentioning

confidence: 99%

“…Ashraf et al [3] compare a shrinking and a non-shrinking solution for a fault-tolerant version of the generalized minimal residual (GMRES) algorithm. The algorithm already offers protection against silent data corruptions, and protection against hard errors is implemented combining ULFM and a backward global recovery using diskless checkpointing.…”

Section: Shrinking Solutionsmentioning

confidence: 99%

Fault tolerance of MPI applications in exascale systems: The ULFM solution

Losada

González

Martín

et al. 2020

Future Generation Computer Systems

View full text Add to dashboard Cite

The growth in the number of computational resources used by high-performance computing (HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become essential for long-running applications executing in future exascale systems, not only to ensure the completion of their execution in these systems but also to improve their energy consumption. Although the Message Passing Interface (MPI) is the most popular programming model for distributed-memory HPC systems, as of now, it does not provide any fault-tolerant construct for users to handle failures. Thus, the recovery procedure is postponed until the application is aborted and re-spawned. The proposal of the User Level Failure Mitigation (ULFM) interface in the MPI forum provides new opportunities in this field, enabling the implementation of resilient MPI applications, system runtimes, and programming language constructs able to detect and react to failures without aborting their execution. This paper presents a global overview of the resilience interfaces provided by the ULFM specification, covers archetypal usage patterns and building blocks, and surveys the wide variety of application-driven solutions that have exploited them in recent years. The large and varied number of approaches in the literature proves that ULFM provides the necessary flexibility to implement efficient faulttolerant MPI applications. All the proposed solutions are based on application-driven recovery mechanisms, which allows reducing the overhead and obtaining the required level of efficiency needed in the future exascale platforms.

show abstract

“…BB underutilization is also caused by the characteristics of the checkpoint/restart. HPC applications perform checkpoint with a fixed period [16]- [18], called checkpoint period, by repeating compute phase and I/O phase periodically. Unfortunately, as the checkpoint period ranges from tens of minutes to tens of hours, expensive BB resources keep idle for long compute phases.…”

Section: A Burst Buffer Underutilizationmentioning

confidence: 99%

BBOS: Efficient HPC Storage Management via Burst Buffer Over-Subscription

Sung

Bang

Kim

et al. 2020

2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)

View full text Add to dashboard Cite

To avoid access to PFS, dedicated BB allocation is preferred despite of severe BB underutilization. Recently, new all-flash HPC storage systems with integrated BB and PFS are proposed, which speed up access to PFS. For this reason, we adopt BB over-subscription allocation method by allowing HPC applications to use BB only for I/O phase for improving BB utilization. Unfortunately, BB over-subscription aggravates I/O interference and demotion overhead from BB to PFS, resulting in degraded performance. To minimize the performance degradation, we develop an I/O scheduler to prevent I/O congestion and a new transparent data management system based on checkpoint/restart characteristics of HPC applications. With the proposed approach, not only the BB utilization can be improved, but also high performance of applications is achieved.In our experiments, we find that BB utilization is improved at least 2.2x, and more stable and higher checkpoint performance is guaranteed compared to other approaches. Besides, we achieve up to 96.4% hit ratio of restart requests on BB and up to 3.1x higher restart performance than others.

show abstract

Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery

Cited by 17 publications

References 19 publications

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

Fault tolerance of MPI applications in exascale systems: The ULFM solution

BBOS: Efficient HPC Storage Management via Burst Buffer Over-Subscription

Contact Info

Product

Resources

About