2015
DOI: 10.1109/tpds.2014.2316825
|View full text |Cite
|
Sign up to set email alerts
|

Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications

Abstract: Heterogeneous parallel computing applications often process large data sets that require multiple GPUs to jointly meet their needs for physical memory capacity and compute throughput. However, the lack of high-level abstractions in previous heterogeneous parallel programming models force programmers to resort to multiple code versions, complex data copy steps and synchronization schemes when exchanging data between multiple GPU devices, which results in high software development cost, poor maintainability, and… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
5
0

Year Published

2016
2016
2021
2021

Publication Types

Select...
3
3

Relationship

1
5

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 15 publications
0
5
0
Order By: Relevance
“…Accelerator-based computing: Motivated by the lack of high-level abstractions in heterogeneous parallel programming models, which requires programmers to resort to complex data copying and synchronization schemes, the research community has come up with various proposals for easing programmability and improving performance. Examples include a runtime system and architecture support for simple and efficient data exchange [18] as well as an integrated message passing framework targeting endto-end data movement among CUDA, OpenCL and CPU memory spaces [19]. An overview of current heterogeneous systems and development frameworks [20] concludes that most works focus on outsourcing compute-intensive tasks entirely to accelerators, leaving the host CPU idle while the accelerators are busy.…”
Section: Related Workmentioning
confidence: 99%
“…Accelerator-based computing: Motivated by the lack of high-level abstractions in heterogeneous parallel programming models, which requires programmers to resort to complex data copying and synchronization schemes, the research community has come up with various proposals for easing programmability and improving performance. Examples include a runtime system and architecture support for simple and efficient data exchange [18] as well as an integrated message passing framework targeting endto-end data movement among CUDA, OpenCL and CPU memory spaces [19]. An overview of current heterogeneous systems and development frameworks [20] concludes that most works focus on outsourcing compute-intensive tasks entirely to accelerators, leaving the host CPU idle while the accelerators are busy.…”
Section: Related Workmentioning
confidence: 99%
“…With respect to the problem of transparently managing a heterogeneous system, Tupinamba [21] proposes a framework for OpenCL, that enables the transparent use of distributed GPUs. In this same vein, Cabezas et al [22] present an interesting architecture-supported take on efficient, transparent data distribution among several GPUs. Nevertheless, this works overlook load balancing, which is essential when trying to make the most of several heterogeneous devices.…”
Section: Related Workmentioning
confidence: 99%
“…We have previously demonstrated the importance of correct use of NUMA topology when mapping host CPU threads to CPU sockets and GPUs for several molecular and cellular simulation applications [26], [28], [29]. Spafford et al and Meredith et al report findings for several other HPC applications [27], [30].…”
Section: Numa and Multi-gpu Compute Nodesmentioning
confidence: 99%