Design and Implementation for Checkpointing of Distributed Resources Using Process-Level Virtualization

Arya, Kapil; Garg, Rohan; Polyakov, A. Yu.; Cooperman, Gene

doi:10.1109/cluster.2016.55

Cited by 19 publications

(23 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The plugin libraries are injected along with the checkpoint library. They serve to translate real ids into virtual ids seen by the application, and to update the virtual address translation table with the new real ids that are seen on restart [21]. This virtualization capability is used to virtualize below the level of the MPI library 4.4.2).…”

Section: Dmtcpmentioning

confidence: 99%

System-Level Scalable Checkpoint-Restart for Petascale Computing

Cao

Arya²,

Garg

et al. 2016

2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS)

Self Cite

View full text Add to dashboard Cite

Fault tolerance for the upcoming exascale generation has long been an area of active research. One of the components of a fault tolerance strategy is checkpointing. Petascale-level checkpointing is demonstrated through a new mechanism for virtualization of the InfiniBand UD (unreliable datagram) mode, and for updating the remote address on each UD-based send, due to lack of a fixed peer. Note that InfiniBand UD is required to support modern MPI implementations. An extrapolation from the current results to future SSD-based storage systems provides evidence that the current approach will remain practical in the exascale generation. This transparent checkpointing approach is evaluated using a framework of the DMTCP checkpointing package. Results are shown for HPCG (linear algebra), NAMD (molecular dynamics), and the NAS NPB benchmarks. In tests up to 32,752 MPI processes on 32,752 CPU cores, checkpointing of a computation with a 38 TB memory footprint in 11 minutes is demonstrated. Runtime overhead is reduced to less than 1%. The approach is also evaluated across three widely used MPI implementations.

show abstract

Section: Dmtcpmentioning

confidence: 99%

System-Level Scalable Checkpoint-Restart for Petascale Computing

Cao

Arya²,

Garg

et al. 2016

2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Experiments use DMTCP [27] version 3.0. We developed a CRUM-specific DMTCP plugin [28] for checkpoint-restart of NVIDIA CUDA UVM applications. The DMTCP CRUM plugin (referred to as the CRUM plugin from here onwards) interposes on the CUDA calls made by the application.…”

Section: Softwarementioning

confidence: 99%

CRUM: Checkpoint-Restart Support for CUDA's Unified Memory

Garg

Mohan

Sullivan

et al. 2018

2018 IEEE International Conference on Cluster Computing (CLUSTER)

Self Cite

View full text Add to dashboard Cite

Unified Virtual Memory (UVM) was recently introduced on recent NVIDIA GPUs. Through software and hardware support, UVM provides a coherent shared memory across the entire heterogeneous node, migrating data as appropriate. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs have started taking advantage of UVM for the same reasons of superior programmability that UNIX applications long ago switched to assuming the presence of virtual memory. Therefore, checkpointing of UVM will become increasingly important, especially as NVIDIA CUDA continues to gain wider popularity: 87 of the top 500 supercomputers in the latest listings are GPU-accelerated, with a current trend of ten additional GPU-based supercomputers each year.A new scalable checkpointing mechanism, CRUM (Checkpoint-Restart for Unified Memory), is demonstrated for hybrid CUDA/MPI computations across multiple computer nodes. CRUM supports a fast, forked checkpointing, which mostly overlaps the CUDA computation with storage of the checkpoint image in stable storage. The runtime overhead of using CRUM is 6% on average, and the time for forked checkpointing is seen to be a factor of up to 40 times less than traditional, synchronous checkpointing.

show abstract

“…DMTCP provides a transparent checkpointing mechanism that provides for checkpoint/restart without any modification of the original application code or operating system. DMTCP also provides a plugin facility to adapt the transparent checkpointing capability of the target application to external subsystems, such as the handling of a network connection [3].…”

Section: A Checkpointing Mechanismmentioning

confidence: 99%

“…It may use a communication protocol, such as Bluetooth or Wi-Fi, in order to send or receive information from a gateway. This concept is aligned with the definition of Sensor from the SSN ontology 3 .…”

Section: A System Representationmentioning

confidence: 99%

Intelligent Checkpointing Strategies for IoT System Management

Aïssaoui

Cooperman

Monteil

et al. 2017

2017 IEEE 5th International Conference on Future Internet of Things and Cloud (FiCloud)

Self Cite

View full text Add to dashboard Cite

The Internet of Things (IoT) continues to expand in terms of the number of connected devices. To handle the data produced by those devices, gateways are deployed to collect data, possibly to analyze it, and finally to send it to the cloud or to the end-user to support new services. This process involves complex software that is deployed on those gateways. Moreover, the dynamicity due to new services, mobility, etc., could be corrupted by new events that then require the deployment of software components on additional equipment. Those new events arise in at least two fundamental ways: devices that may change their geographical location; and limitations due to hardware resources and energy consumption. We propose to use autonomic monitoring and control in response to a changing environment in order to manage deployed software with little or no human intervention. A new generic approach is described, based on a semantic model of the system being monitored. Much of the power of the proposed approach is accomplished through a novel use of checkpointing in order to control the software deployed on the gateway.

show abstract

Design and Implementation for Checkpointing of Distributed Resources Using Process-Level Virtualization

Cited by 19 publications

References 17 publications

System-Level Scalable Checkpoint-Restart for Petascale Computing

System-Level Scalable Checkpoint-Restart for Petascale Computing

CRUM: Checkpoint-Restart Support for CUDA's Unified Memory

Intelligent Checkpointing Strategies for IoT System Management

Contact Info

Product

Resources

About