Mixed-data-model heterogeneous compilation and OpenMP offloading

Kurth, Andreas; Wolters, Koen; Forsberg, Björn; Capotondi, Alessandro; Marongiu, Andrea; Grosser, Tobias; Benini, Luca

doi:10.1145/3377555.3377891

Cited by 5 publications

(5 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Certain notable architectural examples that are easily accessible for prototyping are the 32-bit RI5CY microcontroller core and the 64-bit Ariane microprocessor core [21]. Furthermore, the online community offers multi-core architectures, such as OpenPULP HERO (RI5CY), and OpenPiton (Ariane) [22,23].…”

Section: Risc-v Backgroundmentioning

confidence: 99%

From SW Timing Analysis and Safety Logging to HW Implementation: A Possible Solution with an Integrated and Low-Power Logger Approach

Cosimi,

Arena,

Gai

et al. 2023

JLPEA

View full text Add to dashboard Cite

In this manuscript, we propose a configurable hardware device in order to build a coherent data log unit. We address the need for analyzing mixed-criticality systems, thus guaranteeing the best performances without introducing additional sources of interference. Log data are essential to inspect the behavior of running applications when safety analyses or worst-case execution time measurements are performed. Furthermore, performance and timing investigations are useful for solving scheduling issues to balance resource budgets and investigate misbehavior and failure causes. We additionally present a performance evaluation and log capabilities by means of simulations on a RISC-V use case. The simulations highlight that such a data log unit can trace the execution from a single- to an octa-core microcontroller. Such an analysis allows a silicon developer to obtain the right sizings and timings of devices during the development phase. Finally, we present an analysis of a real RISC-V implementation for a Xilinx UltraScale+ FPGA, which was obtained with Vivado 2018. The results show that our data log unit implementation does not introduce a significant area overhead if compared to the RISC-V core targeted for tests, and that the timing constraints are not violated.

show abstract

Section: Risc-v Backgroundmentioning

confidence: 99%

From SW Timing Analysis and Safety Logging to HW Implementation: A Possible Solution with an Integrated and Low-Power Logger Approach

Cosimi,

Arena,

Gai

et al. 2023

JLPEA

View full text Add to dashboard Cite

show abstract

“…To address mixed-data-width compilation in HEROv2, the Clang frontend has been extended to generate LLVM IR with automatically assigned address spaces. We adopt the techniques of [24], where OpenMP offloading entry points are used to infer that pointers passed to a device kernel from the host are 64-bit wide. The use of such pointers are then tracked throughout the application, such that any pointer that cannot be guaranteed to never hold a 64-bit host address is promoted to the host address space.…”

Section: Interoperability Between Host and Acceleratorsmentioning

confidence: 99%

“…Unlike in HEROv2, accelerators are not programmable with a fullfeatured standard ISA, and there is thus no OpenMP offloading support and no heterogeneous API, runtime libraries, and toolchain that span across host processors and accelerators. HEROv1 [19], [54] does provide the components that enable the evaluation of heterogeneous applications on a mixed-ISA computer, but its toolchain is fundamentally limited to 32-bit hosts and accelerators [24]. Additionally, it has no API that unifies programming over multiple accelerators; it features one host and one accelerator architecture, and hardware and software are tailored to those instead of being modular; and its on-chip network is limited to simple configurations (e.g., fixed 64-bit data width) and topologies (e.g., central crossbar), which do not meet the demands of modern heterogeneous computers.…”

Section: Related Workmentioning

confidence: 99%

“…Those extensions are only available through the proprietary Intel compiler, whereas HEROv2's full toolchain is open source. Research works on GCC [69] were the first to provide an open-source heterogeneous OpenMP toolchain, but GCC's offloading compilation is fundamentally limited to the same data model (e.g., 32-bit) for host and accelerators [24]. Mixed-data-model heterogeneous compilation has been pioneered recently [24] with Clang/LLVM, and HEROv2 integrates that work into its toolchain.…”

Section: Related Workmentioning

confidence: 99%

“…Application programmers usually do not notice this: The compiler generates the correct instructions for accessing pointers outside the native (32-bit physical) address space of the accelerator. In the common case, such accesses hit in the TLB of the IOMMU and incur an overhead of only three cycles per remote memory access [24]. When an access misses in the TLB, the core either invokes the VMM library itself to add an entry to the IOMMU, or it lets a dedicated core handle the misses.…”

Section: Runtime Libraries and Operating System Supportmentioning

confidence: 99%

See 2 more Smart Citations

HEROv2: Full-Stack Open-Source Research Platform for Heterogeneous Computing

Kurth¹,

Forsberg²,

Benini³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Heterogeneous computers integrate general-purpose host processors with domain-specific accelerators to combine versatility with efficiency and high performance. To realize the full potential of heterogeneous computers, however, many hardware and software design challenges have to be overcome. While architectural and system simulators can be used to analyze heterogeneous computers, they are faced with unavoidable compromises between simulation speed and performance modeling accuracy. In this work we present HEROv2, an FPGA-based research platform that enables accurate and fast exploration of heterogeneous computers consisting of accelerators based on clusters of 32-bit RISC-V cores and an application-class 64-bit ARMv8 or RV64 host processor. HEROv2 allows to seamlessly share data between 64-bit hosts and 32-bit accelerators and comes with a fully open-source on-chip network, a unified heterogeneous programming interface, and a mixed-data-model, mixed-ISA heterogeneous compiler based on LLVM. We evaluate HEROv2 in four case studies from the application level over toolchain and system architecture down to accelerator microarchitecture. We demonstrate how HEROv2 enables effective research and development on the full stack of heterogeneous computing. For instance, the compiler can tile loops and infer data transfers to and from the accelerators, which leads to a speedup of up to 4.4× compared to the original program and in most cases is only 15 % slower than a handwritten implementation, which requires 2.6× more code.

show abstract