A parallel scheme for accelerating parameter sweep applications on a GPU

Ino, Fumihiko; Shigeoka, Kentaro; Okuyama, Tadashi; Motokubota, Masaya; Hagihara, Kenichi

doi:10.1002/cpe.3016

Cited by 5 publications

(3 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, the updating operations on different elements are mutually independent, hence stencil computation is an embarrassingly parallel scenario to leverage accelerators such as graphics processing units (GPUs). A GPU has thousands of cores and its memory bandwidth is 5−10× as high as that of a CPU, thus extensively utilized in accelerating compute-and memory-intensive applications [7]- [9]. Nonetheless, GPUs possess a relatively limited device-memory capacity, typically in the range of several dozen GBs.…”

Section: Introductionmentioning

confidence: 99%

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

et al. 2023

View full text Add to dashboard Cite

Stencil computation is an extensively-utilized class of scientific-computing applications that can be efficiently accelerated by graphics processing units (GPUs). Out-of-core approaches enable a GPU to handle large stencil codes whose data size is beyond the memory capacity of the GPU. However, current research on out-of-core stencil computation primarily focus on minimizing the amount of data transferred between the CPU and GPU. Few studies consider simultaneously optimizing data transfer and kernel execution. To fill the research gap, this work presents a synergy between on-and off-chip data reuse for out-of-core stencil codes, termed SO2DR. First, overlapping regions between data chunks are shared in the off-chip memory to eliminate redundant CPU-GPU data transfer. Secondly, redundant computation at the off-chip memory level is intentionally introduced to decouple kernel execution from region sharing, hence enabling data reuse in the onchip memory. Experimental results demonstrate that SO2DR significantly enhances the kernel-execution performance while reducing the CPU-GPU data-transfer time. Specifically, SO2DR achieves average speedups of 2.78× and 1.14× for five stencil benchmarks, compared to an out-of-core stencil code which is free of redundant transfer and computation, and an in-core stencil code which is free of data transfer, respectively.

show abstract

Section: Introductionmentioning

confidence: 99%

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

et al. 2023

View full text Add to dashboard Cite

show abstract

“…However, this process is time consuming because code and data structures usually must be adapted to the highly-threaded device architecture, which takes full advantage of memory latency hiding mechanisms. For example, arrays of structures must be transformed into structures of arrays to maximise memory access throughput on a GPU (Sung et al, 2012;Ino et al, 2014).…”

Section: Introductionmentioning

confidence: 99%

PACC: a directive-based programming framework for out-of-core stencil computation on accelerators

Miki

Ino

Hagihara

2019

IJHPCN

Self Cite

View full text Add to dashboard Cite

We present a directive-based programming framework, i.e., the pipelined accelerator (PACC), to accelerate large-scale stencil computation on an accelerator device, such as a graphics processing unit (GPU). PACC provides a collection of extended OpenACC directives to facilitate out-of-core stencil computation accelerated using temporal blocking. The proposed framework includes a source-to-source translator capable of generating an out-of-core OpenACC code from the PACC code, i.e., large data is automatically decomposed into smaller chunks that are processed using limited capacity device memory. The generated code is optimised using a temporal blocking technique to minimise CPU-GPU data transfer. Furthermore, the code is accelerated using a multithreaded pipeline engine that maximises data copy throughput and overlaps GPU execution and data transfer. In experiments, we applied the proposed translator to three stencil computation codes. The out-of-core performance for 107 GB data on an NVIDIA Tesla K40 GPU with 12 GB memory reached 69.3 GFLOPS, which is 17% less than the in-core performance for 8 GB data. We believe that the proposed directive-based approach can be used to facilitate out-of-core stencil computation on a GPU.

show abstract

“…High-performance clusters and grid systems are practical for performing parameter studies due to their large collection of processors and storage resources [8,19,36], although local computers can also be used due to the advancement of graphic processors and other accelerators [17]. The setup, submission, and orchestration of such jobs in computing clusters may be a challenge, particularly to non-programmers or novice users for conducting parameter studies in a parallel or distributed fashion [10,25].…”

Section: Introductionmentioning

confidence: 99%

PaPaS

Ponce

Stephenson

Lenhart

et al. 2018

Proceedings of the Practice and Experience on Advanced Research Computing

View full text Add to dashboard Cite

The current landscape of scientific research is widely based on modeling and simulation, typically with complexity in the simulation's flow of execution and parameterization properties. Execution flows are not necessarily straightforward since they may need multiple processing tasks and iterations. Furthermore, parameter and performance studies are common approaches used to characterize a simulation, often requiring traversal of a large parameter space. High-performance computers offer practical resources at the expense of users handling the setup, submission, and management of jobs. This work presents the design of PaPaS, a portable, lightweight, and generic workflow framework for conducting parallel parameter and performance studies. Workflows are defined using parameter files based on keyword-value pairs syntax, thus removing from the user the overhead of creating complex scripts to manage the workflow. A parameter set consists of any combination of environment variables, files, partial file contents, and command line arguments. PaPaS is being developed in Python 3 with support for distributed parallelization using SSH, batch systems, and C++ MPI. The PaPaS framework will run as user processes, and can be used in single/multi-node and multi-tenant computing systems. An example simulation using the BehaviorSpace tool from NetLogo and a matrix multiply using OpenMP are presented as parameter and performance studies, respectively. The results demonstrate that the PaPaS framework offers a simple method for defining and managing parameter studies, while increasing resource utilization.

show abstract

A parallel scheme for accelerating parameter sweep applications on a GPU

Cited by 5 publications

References 25 publications

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

PACC: a directive-based programming framework for out-of-core stencil computation on accelerators

PaPaS

Contact Info

Product

Resources

About