IEEEReaño González, C.; Peña Monferrer, AJ.; Silla Jiménez, F.; Duato Marín, JF.; Mayo Gual, R.; Quintana Orti, ES. (2012) Abstract-GPUs are being increasingly embraced by the high performance computing and computational communities as an effective way of considerably reducing execution time by accelerating significant parts of their application codes. However, despite their extraordinary computing capabilities, the adoption of GPUs in current HPC clusters may present certain negative side-effects. In particular, to ease job scheduling in these platforms, a GPU is usually attached to every node of the cluster. In addition to increasing acquisition costs this favors that GPUs may frequently remain idle, as applications usually do not fully utilize them. On the other hand, idle GPUs consume non-negligible amounts of energy, which translates into very poor energy efficiency during idle cycles.rCUDA was recently developed as a software solution to address these concerns. Specifically, it is a middleware that allows transparently sharing a reduced number of GPUs among the nodes in a cluster. rCUDA thus increases the GPU-utilization rate, taking care of job scheduling. While the initial prototype versions of rCUDA demonstrated its functionality, they also revealed several concerns related with usability and performance. With respect to usability, in this paper we present a new component of the rCUDA suite that allows an automatic transformation of any CUDA source code, so that it can be effectively accommodated within this technology. In response to performance, we briefly show some interesting results, which will be deeply analyzed in future publications. The net outcome is a new version of rCUDA that allows, for any CUDA-compatible program, to use remote GPUs in a cluster with minimum overhead.
I. INTRODUCTIONDue to the high computational cost of current computeintensive applications, many scientists view graphic processing units (GPUs) as an efficient means of reducing the execution time of their applications. High-end GPUs include an extraordinary large amount of small computing units along with a high bandwidth to their private on-board memory. Therefore, it is no surprise that applications exhibiting a large ratio of arithmetic operations per data item can leverage the huge potential of these hardware accelerators.In GPU-accelerated applications, high performance is usually attained by off-loading the computationally intensive parts of applications for their execution in these devices. To achieve this, programmers have to specify which portion of their codes will be executed on the CPU and which functions (or kernels) will be off-loaded to the GPU. Fortunately, there have been many attempts during the last years aimed at exploiting the massive parallelism of GPUs, leading to noticeable improvements in the programmability of these hybrid