Improving concurrency and asynchrony in multithreaded MPI applications using software offloading

Vaidyanathan, Karthikeyan; Kalamkar, Dhiraj D.; Pamnany, Kiran; Hammond, Jeff R.; Balaji, Pavan; Das, Dipankar; Park, Jongsoo; Joó, Bálint

doi:10.1145/2807591.2807602

Cited by 30 publications

(9 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An alternative way of improving the overlap is using MPI+OpenACC+OpenMP, in which OpenMP is used to generate multiple threads. These threads can work on different tasks such as computation and communication so that the actual degree of overlap can be increased 30‐34 . In fact, there are more literature discussing how to improve the overlap performance and almost all of them use multiple threads.…”

Section: Resultsmentioning

confidence: 99%

Multi‐GPU performance optimization of a computational fluid dynamics code using OpenACC

Xue

Roy

2020

Concurrency and Computation

View full text Add to dashboard Cite

Summary This article investigates the multi‐GPU performance of a 3D buoyancy driven cavity solver using MPI and OpenACC directives on multiple platforms. The article shows that decomposing the total problem in different dimensions affects the strong scaling performance significantly for the GPU. Without proper performance optimizations, it is shown that 1D domain decomposition scales poorly on multiple GPUs due to the noncontiguous memory access. The performance using whatever decompositions can be benefited from a series of performance optimizations in the article. Since the buoyancy driven cavity code is communication‐bounded on the clusters examined, a series of optimizations both agnostic and tailored to the platforms are designed to reduce the communication cost and improve memory throughput between hosts and devices efficiently. First, the parallel message packing/unpacking strategy developed for noncontiguous data movement between hosts and devices improves the overall performance by about a factor of 2. Second, transferring different data based on the stencil sizes for different variables further reduces the communication overhead. These two optimizations are general enough to be beneficial to stencil computations having ghost exchanges. Third, GPUDirect is used to improve the communication on clusters which have the hardware and software support for direct communication between GPUs without staging the memory of CPU. Finally, overlapping the communication and computations is shown to be not efficient on multi‐GPUs if only using MPI or MPI+OpenACC. Although we believe our implementation has revealed enough communication and computation overlap, the actual running does not utilize the overlap well due to a lack of enough asynchronous progression.

show abstract

Section: Resultsmentioning

confidence: 99%

Multi‐GPU performance optimization of a computational fluid dynamics code using OpenACC

Xue

Roy

2020

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…Vaidyanathan et al [23] contributed an approach for asynchronous progress in the "MPI+X" model by utilizing a dedicated thread together with a lock-free command queue. The "MPI+X" model often utilizes multiple threads over multi-or many-core systems to parallelize computation and employs only a single MPI process per node for internode communication.…”

Section: Communication Asynchronous Progressmentioning

confidence: 99%

Dynamic Adaptable Asynchronous Progress Model for MPI RMA Multiphase Applications

Peña

Hammond

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Abstract-Casper is a process-based asynchronous progress model for MPI one-sided communication on multi-and many-core architectures. The one-sided communication is not yet truly one-sided in most MPI implementations: the target process still relies on software progress to complete incoming operations. Casper allows the user to specify an arbitrary number of cores dedicated to background ghost processes and transparently redirects the user RMA operations to ghost processes by utilizing the PMPI redirection and MPI-3 shared-memory technologies. Although Casper is effective and efficient for applications suffering from lack of asynchronous progress, permanently redirecting operations to a small number of ghost processes might not support complex multiphase applications effectively, which often involve dynamically changing communication density and computing workloads. In this paper, we present an adaptive mechanism in Casper to address the limitation of static asynchronous progress in multiphase applications. We exploit two adaptive strategies, a precise user-guided strategy and a fully transparent and automatic strategy based on self-profiling and prediction, to dynamically reconfigure the asynchronous progress in Casper according to real-time performance characteristics during multiphase execution. We evaluate the adaptive approaches in both microbenchmarks and a real quantum chemistry application suite, NWChem, on the Cray XC30 supercomputer and an Intel Omni-Path cluster.

show abstract

“…Some MPI implementations also offer specific options for truly asynchronous progress (Intel, 2017; Pritchard et al, 2012b). However, these specific options do not enable performance portability, and such asynchronous progress can also require the maximum thread support level (MPI_THREAD_MULTIPLE (Pritchard et al, 2012a)) which can imply some performance overhead as shown in Vaidyanathan et al (2015).…”

Section: Deployment and Comparison On Multiple Nodesmentioning

confidence: 99%

Leveraging the accelerated processing units for seismic imaging: A performance and power efficiency comparison against CPUs and GPUs

Said¹,

Fortin²,

Lamotte³

et al. 2017

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

Oil and gas companies rely on high performance computing to process seismic imaging algorithms such as reverse time migration. Graphics processing units are used to accelerate reverse time migration, but these deployments suffer from limitations such as the lack of high graphics processing unit memory capacity, frequent CPU-GPU communications that may be bottlenecked by the PCI bus transfer rate, and high power consumptions. Recently, AMD has launched the Accelerated Processing Unit (APU): a processor that merges a CPU and a graphics processing unit on the same die featuring a unified CPU-GPU memory. In this paper, we explore how efficiently may the APU be applicable to reverse time migration. Using OpenCL (along with MPI and OpenMP), a CPU/APU/GPU comparative study is conducted on a single node for the 3D acoustic reverse time migration, and then extended on up to 16 nodes. We show the relevance of overlapping the I/O and MPI communications with the computations for the APU and graphics processing unit clusters, that performance results of APUs range between those of CPUs and those of graphics processing units, and that the APU power efficiency is greater than or equal to the graphics processing unit one.

show abstract

Improving concurrency and asynchrony in multithreaded MPI applications using software offloading

Cited by 30 publications

References 22 publications

Multi‐GPU performance optimization of a computational fluid dynamics code using OpenACC

Multi‐GPU performance optimization of a computational fluid dynamics code using OpenACC

Dynamic Adaptable Asynchronous Progress Model for MPI RMA Multiphase Applications

Leveraging the accelerated processing units for seismic imaging: A performance and power efficiency comparison against CPUs and GPUs

Contact Info

Product

Resources

About