Transparent checkpoint-restart over infiniband

Cao, Jiajun; Kerr, Gregory; Arya, Kapil; Cooperman, Gene

doi:10.1145/2600212.2600219

Cited by 30 publications

(28 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In testing on the LU.E benchmark, we saw runtime overhead rise to 9% for 4K CPU cores compared to the 1.7% runtime overhead at 2K cores reported by [13]. This was due to the non-scalable tracing of send/receive requests required by the InfiniBand checkpointing code to shadow hardware state, since InfiniBand devices don't provide a way to "peek" at the current state.…”

Section: Reducing Rc-mode Runtime Overheadmentioning

confidence: 83%

“…At 32,752 CPU cores, the tests use one-third of the compute nodes of Stampede. This is sixteen times as many cores as the largest previous transparent checkpoint of which we are aware [13]. The Stampede supercomputer is rated at 5.2 PFlops (RMAX:sustained) or 8.5 PFlops (RPEAK:peak).…”

Section: Setupmentioning

confidence: 99%

“…This approach does not scale, since BLCR does not support SysV shared memory objects [12]. Most modern MPI implementations require such shared memory for efficient communication among MPI processes on the same node to avoid the delay in going through kernel system calls.Moreover, an important work on transparent, system-level checkpointing is [13], which supported only InfiniBand RC (reliable connection) mode. While that result sufficed for earlier MPI implementations, modern MPI implementations require InfiniBand UD for optimal performance when running with more than about 64 processes.…”

mentioning

confidence: 99%

See 2 more Smart Citations

System-Level Scalable Checkpoint-Restart for Petascale Computing

Cao

Arya²,

Garg

et al. 2016

2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS)

Self Cite

View full text Add to dashboard Cite

Fault tolerance for the upcoming exascale generation has long been an area of active research. One of the components of a fault tolerance strategy is checkpointing. Petascale-level checkpointing is demonstrated through a new mechanism for virtualization of the InfiniBand UD (unreliable datagram) mode, and for updating the remote address on each UD-based send, due to lack of a fixed peer. Note that InfiniBand UD is required to support modern MPI implementations. An extrapolation from the current results to future SSD-based storage systems provides evidence that the current approach will remain practical in the exascale generation. This transparent checkpointing approach is evaluated using a framework of the DMTCP checkpointing package. Results are shown for HPCG (linear algebra), NAMD (molecular dynamics), and the NAS NPB benchmarks. In tests up to 32,752 MPI processes on 32,752 CPU cores, checkpointing of a computation with a 38 TB memory footprint in 11 minutes is demonstrated. Runtime overhead is reduced to less than 1%. The approach is also evaluated across three widely used MPI implementations.

show abstract

Section: Reducing Rc-mode Runtime Overheadmentioning

confidence: 83%

Section: Setupmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

System-Level Scalable Checkpoint-Restart for Petascale Computing

Cao

Arya²,

Garg

et al. 2016

2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS)

Self Cite

View full text Add to dashboard Cite

show abstract

“…For example, memory pinning register segments in the device (to allow address translation and to retrieve authentication tokens) is not checkpointable. In order to circumvent this issue, DMTCP provided a plugin completely wrapping the libverbs (low-level Infiniband programming interface) in order to track and preserve a shadow state of all the operations taking place on the card [11]. This approach enabled transparent checkpointing of Infiniband networks, but not without some drawbacks.…”

Section: High-speed Network Supportmentioning

confidence: 99%

Checkpoint/restart approaches for a thread-based MPI runtime

Adam¹,

Kermarquer

Besnard³

et al. 2019

Parallel Computing

View full text Add to dashboard Cite

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable both transparent and application-level checkpointing mechanisms. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart and ignores other features such as resiliency. We show how existing checkpointing methods can be practically applied to a thread-based MPI implementation given sufficient runtime collaboration. The two main contributions are the preservation of high-speed network performance during transparent C/R and the over-subscription of checkpoint data replication thanks to a dedicated user-level scheduler support. These techniques are measured on MPI benchmarks such as IMB, Lulesh and Heatdis, and associated overhead and trade-offs are discussed.

show abstract

“…As of this writing, Stampede is the #10 supercomputer on the Top500 list [78]. In all cases, each computer node was running This is sixteen times as many cores as the largest previous transparent checkpoint of which we are aware [15]. The Stampede supercomputer is rated at 5.2 PFlops (RMAX:sustained) or 8.5 PFlops (RPEAK:peak) [78].…”

Section: Setupmentioning

confidence: 99%

Transparent checkpointing over RDMA-based networks

Cao¹

View full text Add to dashboard Cite

Fault tolerance for large-scale applications has long been an area of active research, as the size of the computation keeps growing. One of the components of a faulttolerance strategy is checkpointing. However, no explicit checkpoint-restart solution has been available for applications running over RDMA-based networks. RDMA-based networks are the primary network used in high-performance computing, and many researchers believe that RDMA networks will be widely deployed in the Cloud as the costs decrease. Existing approaches often rely on a solution that is specific to the particular MPI implementation or other parallel model in order to disconnect the network at checkpoint time, and to reconnect the network at restart time. Such schemes are difficult to incorporate for new parallel programming models, and also imply higher checkpoint overhead.In this dissertation, we present the first transparent, system-initiated checkpointrestart solution that directly supports RDMA networks. This new approach does not depend on a specific parallel programming model, and does not require any modification to the operating system. In addition, network connections remain active during checkpointing, thus making checkpointing more efficient.Conceptually, this dissertation can be divided into three parts. First, we introduce a new, generic model for RDMA networks, by extracting the key components for checkpointing an RDMA network. These components are the essential states that need to be saved, in order to restore the network connection on restart. This model is then applied to two distinct RDMA networks: InfiniBand, and Intel OmniPath. This work demonstrates the generality of the model, and it also describes variations needed to adapt to InfiniBand or Omni-Path.Second, we demonstrate the performance of the proposed approach. Moving from a medium-sized academic computer cluster to a petascale supercomputer, we show what issues are exposed as the application scales up, and how these issues are addressed. In particular, different strategies to drain the network at checkpoint time are investigated, based on the underlying network protocol. As a result, failure-free overhead is reduced to below 1%, even at the largest scale demonstrated: 32,752 processes.Third, we show how to retrofit transparent checkpointing into the Cloud, as RDMA networks are also becoming more popular in the Cloud. A Checkpointing as a Service approach is presented, which employs checkpointing to provide fault tolerance as a service in the Cloud, and enables application migration in heterogeneous cloud environments.

show abstract

Transparent checkpoint-restart over infiniband

Cited by 30 publications

References 22 publications

System-Level Scalable Checkpoint-Restart for Petascale Computing

System-Level Scalable Checkpoint-Restart for Petascale Computing

Checkpoint/restart approaches for a thread-based MPI runtime

Transparent checkpointing over RDMA-based networks

Contact Info

Product

Resources

About