Coherently Attached Programmable Near-Memory Acceleration Platform and its application to Stencil Processing

Lunteren, Jan van; Luijten, Ronald P.; Diamantopoulos, Dionysios; Auernhammer, Florian; Hagleitner, Christoph; Chelini, Lorenzo; Corda, Stefano; Singh, Gagandeep

doi:10.23919/date.2019.8715088

Cited by 13 publications

(10 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More recently, the use of FPGAs to accelerate stencils has been proposed [9,36,37,44]. Augmenting general-purpose cores with specialized FPGA accelerators is a promising approach to enhance overall system performance.…”

Section: Related Workmentioning

confidence: 99%

Casper: Accelerating Stencil Computation using Near-cache Processing

Denzler¹,

Bera²,

Hajinazar³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Stencil computation is one of the most used kernels in a wide variety of scientific applications, ranging from large-scale weather prediction to solving partial differential equations. Stencil computations are characterized by three unique properties: (1) low arithmetic intensity, (2) limited temporal data reuse, and (3) regular and predictable data access pattern. As a result, stencil computations are typically bandwidth-bound workloads, which only experience limited benefits from the deep cache hierarchy of modern CPUs.In this work, we propose Casper, a near-cache accelerator consisting of specialized stencil compute units connected to the lastlevel cache (LLC) of a traditional CPU. Casper is based on two key ideas: (1) avoiding the cost of moving rarely reused data through the cache hierarchy, and (2) exploiting the regularity of the data accesses and the inherent parallelism of the stencil computation to increase the overall performance. With minimal changes in LLC address decoding logic and data placement, Casper performs stencil computations at the peak bandwidth of the LLC. We show that, by tightly coupling lightweight stencil compute units near to LLC, Casper improves performance of stencil kernels by 1.65× on average, while reducing the energy consumption by 35% compared to a commercial high-performance multi-core processor. Moreover, Casper provides a 37× improvement in performance-per-area compared to a state-of-the-art GPU.

show abstract

Section: Related Workmentioning

confidence: 99%

Casper: Accelerating Stencil Computation using Near-cache Processing

Denzler¹,

Bera²,

Hajinazar³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…We include as a competitor to NMC an NVIDIA V100, one of the latest GPU with 32GB of HBM2 memory at 900 GB/s, which uses similar technology to the NMC platform. As NMC systems we use a custom hardware design called Access Processor (AP) [14], which can be mapped on different FPGAs (DDR4 and HBM2). Differently from a classical general-purpose computer, where the access bandwidth and latency depend on a complex mixture of workload characteristics and the memory hierarchy, the Access Processor (AP) design comprises the socalled memory controller, which has the feature of enabling more control over the memory system and programming all the concurrently running data streams from/to the attached NMC accelerators (see Fig 5).…”

Section: A System In Usementioning

confidence: 99%

“…The AP provides fine-grained control to schedule the accesses to the DDR4 and HBM2 memory (see Fig. 10), the transfer of the data to and from the FPGAs internal SRAM (Block RAM and/or UltraRAM), and the processing of the data [14]. Because the various 1D FFTs (see Fig.…”

Section: B Offloading On Nmc Systemsmentioning

confidence: 99%

Near Memory Acceleration on High Resolution Radio Astronomy Imaging

Corda¹,

Veenboer²,

Awan³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Modern radio telescopes like the Square Kilometer Array (SKA) will need to process in real-time exabytes of radio-astronomical signals to construct a high-resolution map of the sky. Near-Memory Computing (NMC) could alleviate the performance bottlenecks due to frequent memory accesses in a state-of-the-art radio-astronomy imaging algorithm. In this paper, we show that a sub-module performing a two-dimensional fast Fourier transform (2D FFT) is memory bound using CPI breakdown analysis on IBM Power9. Then, we present an NMC approach on FPGA for 2D FFT that outperforms a CPU by up to a factor of 120x and performs comparably to a high-end GPU, while using less bandwidth and memory.

show abstract

“…They are similar to the kernels used in other weather and climate models [97,125,177]. Their performance is dominated by memory-bound operations with unique irregular memory access patterns and low arithmetic intensity that often results in <10% sustained loating-point performance on current CPU-based systems [165].…”

Section: Introductionmentioning

confidence: 99%

Accelerating Weather Prediction Using Near-Memory Reconfigurable Fabric

Singh

Diamantopoulos

Gómez-Luna

et al. 2022

ACM Trans. Reconfigurable Technol. Syst.

Self Cite

View full text Add to dashboard Cite

Ongoing climate change calls for fast and accurate weather and climate modeling. However, when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. These implementations are dominated by complex irregular memory access patterns and low arithmetic intensity that pose fundamental challenges to acceleration. To overcome these challenges, we propose and evaluate the use of near-memory acceleration using a reconfigurable fabric with high-bandwidth memory (HBM). We focus on compound stencils that are fundamental kernels in weather prediction models. By using high-level synthesis techniques, we develop NERO, an FPGA+HBM-based accelerator connected through OCAPI (Open Coherent Accelerator Processor Interface) to an IBM POWER9 host system. Our experimental results show that NERO outperforms a 16-core POWER9 system by 5.3 × and 12.7 × when running two different compound stencil kernels. NERO reduces the energy consumption by 12 × and 35 × for the same two kernels over the POWER9 system with an energy efficiency of 1.61 GFLOPS/Watt and 21.01 GFLOPS/Watt. We conclude that employing near-memory acceleration solutions for weather prediction modeling is promising as a means to achieve both high performance and high energy efficiency.

show abstract

Coherently Attached Programmable Near-Memory Acceleration Platform and its application to Stencil Processing

Cited by 13 publications

References 12 publications

Casper: Accelerating Stencil Computation using Near-cache Processing

Casper: Accelerating Stencil Computation using Near-cache Processing

Near Memory Acceleration on High Resolution Radio Astronomy Imaging

Accelerating Weather Prediction Using Near-Memory Reconfigurable Fabric

Contact Info

Product

Resources

About