HePREM: A Predictable Execution Model for GPU-based Heterogeneous SoCs

Forsberg, Björn; Benini, Luca; Marongiu, Andrea

doi:10.1109/tc.2020.2980520

Cited by 11 publications

(10 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The relevance of this observation becomes clear when we consider the recent efforts that the research community has put into making the adoption of modern HeSoCs feasible in the context of real-time applications [7,16,37]. The stricter the real-time requirements, the more modest the adoption of parallel systems has been so far in this domain.…”

Section: Predictable Execution and Memory Underutilisationmentioning

confidence: 99%

“…By scheduling memory phases in a mutually exclusive manner memory contention is avoided. Originally formulated to address concurrent accesses between single-core CPU and devices with direct memory access [24], PREM has been later successfully extended to the case of multi-core CPUs [5,27] and of HeSoCs [15,16,20]. Although effective at guaranteeing predictable timing of memory accesses, PREM-like approaches greatly sacrifice memory bandwidth utilization, as bandwidth in a modern HeSoC is sized to concurrently serve multiple computing units.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Evaluating Controlled Memory Request Injection for Efficient Bandwidth Utilization and Predictable Execution in Heterogeneous SoCs

Brilli

Cavicchioli

Solieri³

et al. 2022

ACM Trans. Embed. Comput. Syst.

Self Cite

View full text Add to dashboard Cite

High-performance embedded platforms are increasingly adopting heterogeneous systems-on-chip (HeSoC) that couple multi-core CPUs with accelerators such as GPU, FPGA or AI engines. Adopting HeSoCs in the context of real-time workloads is not immediately possible, though, as contention on shared resources like the memory hierarchy – and in particular the main memory (DRAM) – causes unpredictable latency increase. To tackle this problem, both the research community and certification authorities mandate (i) that accesses from parallel threads to the shared system resources (typically, main memory) happen in a mutually exclusive manner by design, or (ii) that per-thread bandwidth regulation is enforced. Such arbitration schemes provide timing guarantees, but make poor use of the memory bandwidth available in a modern HeSoC. Controlled Memory Request Injection (CMRI) is a recently-proposed bandwidth limitation concept that builds on top of a mutually-exclusive schedule but still allows the threads currently not entitled to access memory to use as much of the unused bandwidth as possible without losing the timing guarantee. CMRI has been discussed in the context of a multi-core CPU, but the same principle applies also to a more complex system such as an HeSoC. In this paper we introduce two CMRI schemes suitable for HeSoCs: Voluntary Throttling via code refactoring and Bandwidth Regulation via dynamic throttling. We extensively characterize a proof-of-concept incarnation of both schemes on two HeSoCs: an NVIDIA Tegra TX2 and a Xilinx UltraScale+, highlighting the benefits and the costs of CMRI for synthetic workloads that model worst-case DRAM access. We also test the effectiveness of CMRI with real benchmarks, studying the effect of interference among the host CPU and the accelerators.

show abstract

Section: Predictable Execution and Memory Underutilisationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Evaluating Controlled Memory Request Injection for Efficient Bandwidth Utilization and Predictable Execution in Heterogeneous SoCs

Brilli

Cavicchioli

Solieri³

et al. 2022

ACM Trans. Embed. Comput. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The AutoDMA plugin is also able to perform loop tiling to extract segments of code whose memory footprint is small enough to fit in the local memory. The AutoDMA plugin is an extension of HePREM [28], originally envisioned for transforming real-time GPU code to be less sensitive to memory interference. This was achieved by transforming GPU kernels into a series of load, execute, and store phases, with explicit synchronization points between them.…”

Section: Ease Of Programming and Code Portabilitymentioning

confidence: 99%

HEROv2: Full-Stack Open-Source Research Platform for Heterogeneous Computing

Kurth¹,

Forsberg²,

Benini³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Heterogeneous computers integrate general-purpose host processors with domain-specific accelerators to combine versatility with efficiency and high performance. To realize the full potential of heterogeneous computers, however, many hardware and software design challenges have to be overcome. While architectural and system simulators can be used to analyze heterogeneous computers, they are faced with unavoidable compromises between simulation speed and performance modeling accuracy. In this work we present HEROv2, an FPGA-based research platform that enables accurate and fast exploration of heterogeneous computers consisting of accelerators based on clusters of 32-bit RISC-V cores and an application-class 64-bit ARMv8 or RV64 host processor. HEROv2 allows to seamlessly share data between 64-bit hosts and 32-bit accelerators and comes with a fully open-source on-chip network, a unified heterogeneous programming interface, and a mixed-data-model, mixed-ISA heterogeneous compiler based on LLVM. We evaluate HEROv2 in four case studies from the application level over toolchain and system architecture down to accelerator microarchitecture. We demonstrate how HEROv2 enables effective research and development on the full stack of heterogeneous computing. For instance, the compiler can tile loops and infer data transfers to and from the accelerators, which leads to a speedup of up to 4.4× compared to the original program and in most cases is only 15 % slower than a handwritten implementation, which requires 2.6× more code.

show abstract

“…Worst-case execution time analysis: In recent years, works on worst-case execution time (WCET) analysis for GPU programs has gained attention (Betts and Donaldson, 2013;Berezovskyi et al, 2014;Forsberg et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Performance and Security Analysis for GPU-Based Applications

Horga

2022

Linköping Studies in Science and Technology. Dissertations

View full text Add to dashboard Cite

Grafikprocessorer (eng. Graphics Processing Units eller GPUer) är ursprungligen hårdvaruplattformar för att accelerera återgivning av grafik på exempelvis datorskärmar. Över tid har GPUer blivit allt bättre på att processa bilddata, både i termer av hastighet och bildstorlek, samtidigt som deras strömförbrukning har minskat. Dessa egenskaper har gjort GPUer attraktiva att använda i olika domäner för att snabba upp beräkningar och databehandling även på andra typer av data än bilder. Tillämpningar där GPUer används för andra syften än att generera grafik brukar benämnas GPGPU på engelska (för general-purpose computing on graphics processing unit). GPGPU används idag inom en rad olika domäner, exempelvis i flygelektronik, i bilar, och t.o.m. inom vården. Dessa nya användningsområden medför dock nya krav på hård-och mjukvara när det gäller exempelvis prestanda och säkerhet. I den här avhandlingen föreslår vi lösningar för att hantera sådana krav. Vi presenterar olika programvaruverktyg och tekniker för att analysera mjukvara för GPGPU, och visar hur våra föreslagna lösningar kan användas för att upptäcka prestandamässiga flaskhalsar. Vi visar också hur olika egenskaper hos sådan GPGPU-mjukvara kan mätas, för att hantera olika typer av säkerhetsproblem.

show abstract

HePREM: A Predictable Execution Model for GPU-based Heterogeneous SoCs

Cited by 11 publications

References 34 publications

Evaluating Controlled Memory Request Injection for Efficient Bandwidth Utilization and Predictable Execution in Heterogeneous SoCs

Evaluating Controlled Memory Request Injection for Efficient Bandwidth Utilization and Predictable Execution in Heterogeneous SoCs

HEROv2: Full-Stack Open-Source Research Platform for Heterogeneous Computing

Performance and Security Analysis for GPU-Based Applications

Contact Info

Product

Resources

About