Toward Supporting Multi-GPU Targets via Taskloop and User-Defined Schedules

Kale, Vivek; Lu, Wenbin; Curtis, Anthony; Malik, Abid M.; Chapman, Barbara; Hernández, Óscar

doi:10.1007/978-3-030-58144-2_19

Cited by 8 publications

(5 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Torres et al [19] propose extensions of OpenMP to distribute workload between multiple devices. Kale et al [20] propose extensions to OpenMP task constructs to schedule loop computations across multiple GPUs.…”

Section: Related Workmentioning

confidence: 99%

“…Research works (Xu et al [14], Komoda et al [15], Yan et al [16], [17], Cho et al [18], Torres et al [19], Kale et al [20]) propose extensions to OpenMP and OpenACC to automate the complex process of distributing the computations and data of parallel loops between CPUs and accelerators. However, these works focus on the homogeneous distribution of loop iterations across multiple GPUs to achieve load balance.…”

Section: Introductionmentioning

confidence: 99%

“…However, these works focus on the homogeneous distribution of loop iterations across multiple GPUs to achieve load balance. In the research works (Xu et al [14], Komoda et al [15], Cho et al [18], Torres et al [19], Kale et al [20])), the host CPU is only used for synchronizing the results from the accelerators. Furthermore, the research works above do not consider the binding of the loop iterations to the underlying CPU cores, which is essential to obtaining the best performance.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

OpenH: A Novel Programming Model and API for Developing Portable Parallel Programs on Heterogeneous Hybrid Servers

Farrelly,

Manumachu,

Lastovetsky

2024

IEEE Access

View full text Add to dashboard Cite

Heterogeneous nodes composed of a multicore CPU and accelerators are today's norm in highperformance computing (HPC) platforms due to their superior performance and energy efficiency. Tools such as OpenCL and hybrid combinations such as OpenMP plus OpenACC are used for developing portable parallel programs for such nodes. However, these tools have some drawbacks, including a lack of compiler support for nested parallelism, performance portability, automatic heterogeneous workload distribution, userfriendly thread placement, and processor affinity essential to the portable performance of hybrid programs executing on such nodes. In this paper, we propose OpenH, a novel programming model and library API for developing portable parallel programs on heterogeneous hybrid servers composed of a multicore CPU and one or more different types of accelerators. OpenH integrates Pthreads, OpenMP, and OpenACC seamlessly to facilitate the development of hybrid parallel programs. An OpenH hybrid parallel program starts as a single main thread, creating a group of Pthreads called hosting Pthreads. A hosting Pthread then leads the execution of a software component of the program, either an OpenMP multithreaded component running on the CPU cores or an OpenACC (or OpenMP) component running on one of the accelerators of the server. The OpenH library provides API functions that allow programmers to get the configuration of the executing environment and bind the hosting Pthreads (and hence the execution of components) of the program to the CPU cores of the hybrid server to get the best performance. We illustrate the OpenH programming model and library API using two hybrid parallel applications based on matrix multiplication and 2D fast Fourier transform for the most general case of a hybrid hyperthreaded server comprising p computing devices. Finally, we demonstrate the practical performance and energy consumption of OpenH for the hybrid parallel matrix multiplication application on a server comprising an Intel Icelake multicore CPU and two Nvidia A40 GPUs.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

OpenH: A Novel Programming Model and API for Developing Portable Parallel Programs on Heterogeneous Hybrid Servers

Farrelly,

Manumachu,

Lastovetsky

2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…But the main differences arise from the introduction of an abstraction such as Compute Unit and also the lack of support for a distributed shared memory approach. More recent works have evaluated the OpenMP 5.2 specification, 27,28 and have indicated the lack for appropriate work-distribution schemes for hybrid executions, as well as the nonexistence of support to solve the entanglement between the work distribution and the data placement in a distributed shared-memory architecture.…”

Section: Related Workmentioning

confidence: 99%

Compute units in OpenMP: Extensions for heterogeneous parallel programming

Gonzàlez‐Tallada,

Morancho

2023

Concurrency and Computation

View full text Add to dashboard Cite

SummaryThis article evaluates the current support for heterogeneous OpenMP 5.2 applications regarding the simultaneous activation of host and device computing units (e.g., CPUs, GPUs, or FPGAs). The article identifies limitations in the current OpenMP specification and describes the design and implementation of novel OpenMP extensions and runtime support for heterogeneous parallel programming. The Compute Unit (CUs) abstraction is introduced in the OpenMP programming model. The Compute Unit abstraction is defined in terms of an aggregation of computing elements (e.g., CPUs, GPUs, FPGAs). On top of CUs, the article describes dynamic work sharing constructs and schedulers that address the inherent differences in compute power of host and device CUs. New constructs and the corresponding runtime support are described for the new abstractions. The article evaluates the case of a hybrid multilevel parallelization of the NPB‐MZ benchmark suite. The implementation exploits both coarse‐grain and fine‐grain parallelism, mapped to CUs of different nature (GPUs and CPUs). All CUs are activated using the new extensions and runtime support. We compare hybrid and nonhybrid executions under two state‐of‐the‐art work‐distribution schemes (Static and Dynamic Task schedulers). On a computing node composed of one AMD EPYC 7742 @ 2.250GHz (64 cores and 2 threads/core, totalling 128 threads per node) and 2 GPU AMD Radeon Instinct MI50 with 32GB, hybrid executions present speedups from 1.08 up to 3.18 with respect to a nonhybrid GPU implementation, depending on the number of activated CUs.

show abstract

“…Concerning OpenMP, several proposals have been done to address the usage of multi-devices. One of them shows how OpenMP can be useful to assign work to multiple GPUs on a node by collectively offloading tasks containing OpenMP target regions to the GPUs of a multi-GPU environment [16]. However, their implementation is explicitly performed using the current language features, and not directly implemented into the compiler infrastructure.…”

Section: Related Workmentioning

confidence: 99%

A Novel Set of Directives for Multi-device Programming with OpenMP

Torres

Ferrer

Teruel

2022

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

The latest versions of OpenMP have been offering support for offloading execution to the accelerator devices present in a variety of heterogeneous architectures via the target directives. However, these directives can only refer to one device at a time, which makes multi-device programming an explicit and tedious task. In this work, we present an extension of OpenMP in the form of a new set of directives (target spread directives) which offers direct support for multiple devices and allows the distribution of data and/or workload among them without explicit programming. This results in an additional level of parallelism between the host and the devices. The target spread directives were evaluated using the Somier micro-app in a PowerPC cluster node with up to four Nvidia Tesla V100 GPUs. The results showed a speedup of approximately 2X using four GPUs and the new directive set, in comparison with the baseline implementation which used one GPU and the existing target directive set.

show abstract

Toward Supporting Multi-GPU Targets via Taskloop and User-Defined Schedules

Cited by 8 publications

References 18 publications

OpenH: A Novel Programming Model and API for Developing Portable Parallel Programs on Heterogeneous Hybrid Servers

OpenH: A Novel Programming Model and API for Developing Portable Parallel Programs on Heterogeneous Hybrid Servers

Compute units in OpenMP: Extensions for heterogeneous parallel programming

A Novel Set of Directives for Multi-device Programming with OpenMP

Contact Info

Product

Resources

About