Transparent Accelerator Migration in a Virtualized GPU Environment

Xiao, Shiyu; Balaji, Pavan; Dinan, James; Zhu, Quing; Thakur, Rajeev; Coghlan, Susan; Lin, Heshan; Wen, Gongjian; Hong, Ju; Feng, Wu-chun

doi:10.1109/ccgrid.2012.26

Cited by 28 publications

(15 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the challenges are similar in this context, ie, ensuring a consistent state prior to the migration. This can, eg, be achieved by a virtualization of the accelerator resources() and a migration of the accelerator images across physical devices at synchronization points. This way, even the migration across heterogeneous architectures is possible if frameworks such as OpenCL are used …”

Section: Resultsmentioning

confidence: 99%

Prospects and challenges of virtual machine migration in HPC

Pickartz

Clauss

Breitbart

et al. 2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary The continuous growth of supercomputers is accompanied by increased complexity of the intra‐node level and the interconnection topology. Consequently, the whole software stack ranging from the system software to the applications has to evolve, eg, by means of fault tolerance and support for the rising intra‐node parallelism. Migration techniques are one means to address these challenges. On the one hand, they facilitate the maintenance process by enabling the evacuation of individual nodes during runtime, ie, the implementation of fault avoidance. On the other hand, they enable dynamic load balancing for an improvement of the system's efficiency. However, these prospects come along with certain challenges. On the process level, migration mechanisms have to resolve so‐called residual dependencies to the source node, eg, the communication hardware. On the job level, migrations affect the communication topology, which should be addressed by the communication stack, ie, the optimal communication path between a pair of processes might change after a migration. In this article, we explore migration mechanisms for HPC and discuss their prospects as well as the challenges. Furthermore, we present solutions enabling their efficient usage in this domain. Finally, we evaluate our prototype co‐scheduler leveraging migration for workload optimization.

show abstract

Section: Resultsmentioning

confidence: 99%

Prospects and challenges of virtual machine migration in HPC

Pickartz

Clauss

Breitbart

et al. 2018

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…VOCL solves the local limitations of GPU devices by using modified GPGPU APIs to migrate GPGPU tasks to remote nodes so that nodes without GPUs can handle GPGPU tasks. Based on VOLC, the method in accelerator migration, supports the load balancing of GPU resources across GPU clusters by migrating GPGPU tasks in the GPU cluster environment. Floating devices is one of the GPU migration techniques based on rCUDA, an RPC‐based GPU sharing technology.…”

Section: Related Workmentioning

confidence: 99%

Partial migration technique for GPGPU tasks to Prevent GPU Memory Starvation in RPC‐based GPU Virtualization

Kang

Lim

2020

Softw Pract Exp

View full text Add to dashboard Cite

Graphics processing unit (GPU) virtualization technology enables a single GPU to be shared among multiple virtual machines (VMs), thereby allowing multiple VMs to perform GPU operations simultaneously with a single GPU. Because GPUs exhibit lower resource scalability than central processing units (CPUs), memory, and storage, many VMs encounter resource shortages while running GPU operations concurrently, implying that the VM performing the GPU operation must wait to use the GPU. In this paper, we propose a partial migration technique for general-purpose graphics processing unit (GPGPU) tasks to prevent the GPU resource shortage in a remote procedure call-based GPU virtualization environment. The proposed method allows a GPGPU task to be migrated to another physical server's GPU based on the available resources of the target's GPU device, thereby reducing the wait time of the VM to use the GPU. With this approach, we prevent resource shortages and minimize performance degradation for GPGPU operations running on multiple VMs. Our proposed method can prevent GPU memory shortage, improve GPGPU task performance by up to 14%, and improve GPU computational performance by up to 82%. In addition, experiments show that the migration of GPGPU tasks minimizes the impact on other VMs.

show abstract

“…To checkpoint/restart a GPU application, the computation state is the key. Such a state is collected/constructed at a checkpointing event and restored at a later restarting event 24 . The state of a GPU application can be represented by variables declared in the program.…”

Section: Checkpoint/restartmentioning

confidence: 99%

A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States

Jiang¹,

Zhang²,

Jennes³

et al. 2013

IJNDC

View full text Add to dashboard Cite

Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many long-running scientific applications. The common approach is to save computation states in memory and secondary storage for execution resumption. However, as the GPU plays a much bigger role in high performance computing, there is no effective checkpoint/restart scheme yet due to the difficulty of the GPU computation state handling. This paper proposes an application-level checkpoint/restart scheme to save and restore GPU computation states in annotated user programs. A pre-compiler and run-time support module are developed to construct and save states in CPU system memory dynamically, whereas secondary storage can be utilized for scalability and long-term fault tolerance. CUDA programs with complicated computation states are supported. State-related variables dissipated in various memory units are collected. Both stack and heap are duplicated at application level for state construction. Experimental results have demonstrated the effectiveness of the proposed scheme.

show abstract

Transparent Accelerator Migration in a Virtualized GPU Environment

Cited by 28 publications

References 10 publications

Prospects and challenges of virtual machine migration in HPC

Prospects and challenges of virtual machine migration in HPC

Partial migration technique for GPGPU tasks to Prevent GPU Memory Starvation in RPC‐based GPU Virtualization

A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States

Contact Info

Product

Resources

About