Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications

Cabezas, Javier; Gelado, Isaac; Stone, John E.; Navarro, Nacho; Kirk, David B.; Hwu, Wen-mei W.

doi:10.1109/tpds.2014.2316825

Cited by 7 publications

(5 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Accelerator-based computing: Motivated by the lack of high-level abstractions in heterogeneous parallel programming models, which requires programmers to resort to complex data copying and synchronization schemes, the research community has come up with various proposals for easing programmability and improving performance. Examples include a runtime system and architecture support for simple and efficient data exchange [18] as well as an integrated message passing framework targeting endto-end data movement among CUDA, OpenCL and CPU memory spaces [19]. An overview of current heterogeneous systems and development frameworks [20] concludes that most works focus on outsourcing compute-intensive tasks entirely to accelerators, leaving the host CPU idle while the accelerators are busy.…”

Section: Related Workmentioning

confidence: 99%

Lightweight Virtual Memory Support for Zero-Copy Sharing of Pointer-Rich Data Structures in Heterogeneous Embedded SoCs

Vogel

Marongiu

Benini

2017

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

While high-end heterogeneous systems are increasingly supporting heterogeneous uniform memory access (hUMA), their low-power counterparts still lack basic features like virtual memory support for accelerators. Instead of simply passing pointers, explicit data management involving copies is needed which hampers programmability and performance. In this work, we evaluate a mixed hardware/software solution for lightweight virtual memory support for many-core accelerators in heterogeneous embedded systems-on-chip. Based on an input/output translation lookaside buffer managed by a host kernel-level driver, and compiler extensions protecting the accelerator's accesses to shared data, our solution is non-intrusive to the architecture of the accelerator cores, and enables zero-copy sharing of pointer-rich data structures.

show abstract

Section: Related Workmentioning

confidence: 99%

Lightweight Virtual Memory Support for Zero-Copy Sharing of Pointer-Rich Data Structures in Heterogeneous Embedded SoCs

Vogel

Marongiu

Benini

2017

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…With respect to the problem of transparently managing a heterogeneous system, Tupinamba [21] proposes a framework for OpenCL, that enables the transparent use of distributed GPUs. In this same vein, Cabezas et al [22] present an interesting architecture-supported take on efficient, transparent data distribution among several GPUs. Nevertheless, this works overlook load balancing, which is essential when trying to make the most of several heterogeneous devices.…”

Section: Related Workmentioning

confidence: 99%

Extending OmpSs for OpenCL Kernel Co-Execution in Heterogeneous Systems

Pérez

Stafford

Bosque

et al. 2017

2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

View full text Add to dashboard Cite

Abstract-Heterogeneous systems have a very high potential performance but present difficulties in their programming. OmpSs is a well known framework for task based parallel applications, which is an interesting tool to simplify the programming of these systems. However, it does not support the co-execution of a single OpenCL kernel instance on several compute devices. To overcome this limitation, this paper presents an extension of the OmpSs framework that solves two main objectives: the automatic division of datasets among several devices and the management of their memory address spaces. To adapt to different kinds of applications, the data division can be performed by the novel HGuided load balancing algorithm or by the well known Static and Dynamic. All this is accomplished with negligible impact on the programming. Experimental results reveal that there is always one load balancing algorithm that improves the performance and energy consumption of the system.

show abstract

“…We have previously demonstrated the importance of correct use of NUMA topology when mapping host CPU threads to CPU sockets and GPUs for several molecular and cellular simulation applications [26], [28], [29]. Spafford et al and Meredith et al report findings for several other HPC applications [27], [30].…”

Section: Numa and Multi-gpu Compute Nodesmentioning

confidence: 99%

High Performance Molecular Visualization: In-Situ and Parallel Rendering with EGL

Stone

Messmer²,

Sisneros

et al. 2016

2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Self Cite

View full text Add to dashboard Cite

Large scale molecular dynamics simulations produce terabytes of data that is impractical to transfer to remote facilities. It is therefore necessary to perform visualization tasks in-situ as the data are generated, or by running interactive remote visualization sessions and batch analyses co-located with direct access to high performance storage systems. A significant challenge for deploying visualization software within clouds, clusters, and supercomputers involves the operating system software required to initialize and manage graphics acceleration hardware. Recently, it has become possible for applications to use the Embedded-system Graphics Library (EGL) to eliminate the requirement for windowing system software on compute nodes, thereby eliminating a significant obstacle to broader use of high performance visualization applications. We outline the potential benefits of this approach in the context of visualization applications used in the cloud, on commodity clusters, and supercomputers. We discuss the implementation of EGL support in VMD, a widely used molecular visualization application, and we outline benefits of the approach for molecular visualization tasks on petascale computers, clouds, and remote visualization servers. We then provide a brief evaluation of the use of EGL in VMD, with tests using developmental graphics drivers on conventional workstations and on Amazon EC2 G2 GPU-accelerated cloud instance types. We expect that the techniques described here will be of broad benefit to many other visualization applications.

show abstract

Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications

Cited by 7 publications

References 15 publications

Lightweight Virtual Memory Support for Zero-Copy Sharing of Pointer-Rich Data Structures in Heterogeneous Embedded SoCs

Lightweight Virtual Memory Support for Zero-Copy Sharing of Pointer-Rich Data Structures in Heterogeneous Embedded SoCs

Extending OmpSs for OpenCL Kernel Co-Execution in Heterogeneous Systems

High Performance Molecular Visualization: In-Situ and Parallel Rendering with EGL

Contact Info

Product

Resources

About