Efficiency and productivity for decision making on low-power heterogeneous CPU+GPU SoCs

Constantinescu, Denisa-Andreea; Navarro, Ángeles; Corbera, Francisco; Fernández‐Madrigal, Juan‐Antonio; Asenjo, Rafael

doi:10.1007/s11227-020-03257-3

Cited by 12 publications

(4 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For both CUDA and HIP, the first two operations may be fulfilled by the {cuda,hip}MemGetInfo and {cuda,hip}Malloc routines, the latter of which produces C-style pointers to device memory segments. As of the SYCL 2020 standard (as adopted by DPC++) [14], similar allocation semantics may be achieved using unified shared memory (USM) [62,63] via the routines cl::sycl::device::get info and cl::sycl::malloc device, respectively. Given the C-pointers to device memory segments from any of the above programming models, pool allocations for the preallocated memory may be implemented in a high-level device-agnostic language (e.g., C/C++) in a straight forward manner.…”

Section: Performance Portability By Modular Software Designmentioning

confidence: 99%

Achieving performance portability in Gaussian basis set density functional theory on accelerator based architectures in NWChemEx

et al. 2021

View full text Add to dashboard Cite

Section: Performance Portability By Modular Software Designmentioning

confidence: 99%

Achieving performance portability in Gaussian basis set density functional theory on accelerator based architectures in NWChemEx

et al. 2021

View full text Add to dashboard Cite

“…As far as we know, the only work that addresses co-execution with oneAPI is [37]. The authors extended the Intel TBB parallel_for function to allow simultaneous execution of the same kernel on CPU and GPU.…”

Section: Related Workmentioning

confidence: 99%

Straightforward Heterogeneous Computing with the oneAPI Coexecutor Runtime

Nozal

Bosque

2021

Electronics

View full text Add to dashboard Cite

Heterogeneous systems are the core architecture of most computing systems, from high-performance computing nodes to embedded devices, due to their excellent performance and energy efficiency. Efficiently programming these systems has become a major challenge due to the complexity of their architectures and the efforts required to provide them with co-execution capabilities that can fully exploit the applications. There are many proposals to simplify the programming and management of acceleration devices and multi-core CPUs. However, in many cases, portability and ease of use compromise the efficiency of different devices—even more so when co-executing. Intel oneAPI, a new and powerful standards-based unified programming model, built on top of SYCL, addresses these issues. In this paper, oneAPI is provided with co-execution strategies to run the same kernel between different devices, enabling the exploitation of static and dynamic policies. This work evaluates the performance and energy efficiency for a well-known set of regular and irregular HPC benchmarks, using two heterogeneous systems composed of an integrated GPU and CPU. Static and dynamic load balancers are integrated and evaluated, highlighting single and co-execution strategies and the most significant key points of this promising technology. Experimental results show that co-execution is worthwhile when using dynamic algorithms and improves the efficiency even further when using unified shared memory.

show abstract

“…Moreover, oneAPI is becoming very popular. It has shown promising results in various computer fields, such as machine learning (Goli et al, 2020) or decision-making (Constantinescu et al, 2020). One of the keys to their growing impact on the heterogeneous field is the existence of optimized libraries that can be used together with oneAPI.…”

Section: Introductionmentioning

confidence: 99%

Performance portability in a real world application: PHAST applied to Caffe

Martínez

Peccerillo

Bartolini

et al. 2022

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

This work covers the PHAST Library’s employment, a hardware-agnostic programming library, to a real-world application like the Caffe framework. The original implementation of Caffe consists of two different versions of the source code: one to run on CPU platforms and another one to run on the GPU side. With PHAST, we aim to develop a single-source code implementation capable of running efficiently on CPU and GPU. In this paper, we start by carrying out a basic Caffe implementation performance analysis using PHAST. Then, we detail possible performance upgrades. We find that the overall performance is dominated by few ‘heavy’ layers. In refining the inefficient parts of this version, we find two different approaches: improvements to the Caffe source code and improvements to the PHAST Library itself, which ultimately translates into improved performance in the PHAST version of Caffe. We demonstrate that our PHAST implementation achieves performance portability on CPUs and GPUs. With a single source, the PHAST version of Caffe provides the same or even better performance than the original version of Caffe built from two different codebases. For the MNIST database, the PHAST implementation takes an equivalent amount of time as native code in CPU and GPU. Furthermore, PHAST achieves a speedup of 51% and a 49% with the CIFAR-10 database against native code in CPU and GPU, respectively. These results provide a new horizon for software development in the upcoming heterogeneous computing era.

show abstract

Efficiency and productivity for decision making on low-power heterogeneous CPU+GPU SoCs

Cited by 12 publications

References 16 publications

Achieving performance portability in Gaussian basis set density functional theory on accelerator based architectures in NWChemEx

Achieving performance portability in Gaussian basis set density functional theory on accelerator based architectures in NWChemEx

Straightforward Heterogeneous Computing with the oneAPI Coexecutor Runtime

Performance portability in a real world application: PHAST applied to Caffe

Contact Info

Product

Resources

About