The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface

Zohouri, Hamid Reza; Matsuoka, Satoshi

doi:10.1109/h2rc49586.2019.00007

Cited by 15 publications

(5 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…First, benchmarking traditional memory on FPGAs. Previous work [13], [14], [15] tries to benchmark traditional memory, e.g., DDR3, on the FPGA by using high-level languages, e.g., OpenCL. In contrast, we benchmark HBM on the stateof-the-art FPGA.…”

Section: Related Workmentioning

confidence: 99%

Shuhai: Benchmarking High Bandwidth Memory On FPGAS

Wang

Huang

Zhang

et al. 2020

2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

View full text Add to dashboard Cite

FPGAs are starting to be enhanced with High Bandwidth Memory (HBM) as a way to reduce the memory bandwidth bottleneck encountered in some applications and to give the FPGA more capacity to deal with application state. However, the performance characteristics of HBM are still not well specified, especially in the context of FPGAs. In this paper, we bridge the gap between nominal specifications and actual performance by benchmarking HBM on a state-of-the-art FPGA, i.e., a Xilinx Alveo U280 featuring a two-stack HBM subsystem. To this end, we propose Shuhai, a benchmarking tool that allows us to demystify all the underlying details of HBM on an FPGA. FPGA-based benchmarking should also provide a more accurate picture of HBM than doing so on CPUs/GPUs, since CPUs/GPUs are noisier systems due to their complex control logic and cache hierarchy. Since the memory itself is complex, leveraging custom hardware logic to benchmark inside an FPGA provides more details as well as accurate and deterministic measurements. We observe that 1) HBM is able to provide up to 425 GB/s memory bandwidth, and 2) how HBM is used has a significant impact on performance, which in turn demonstrates the importance of unveiling the performance characteristics of HBM so as to select the best approach. Shuhai can be easily generalized to other FPGA boards or other generations of memory, e.g., HBM3, and DDR3. We will make Shuhai open-source, benefiting the community.

show abstract

Section: Related Workmentioning

confidence: 99%

Shuhai: Benchmarking High Bandwidth Memory On FPGAS

Wang

Huang

Zhang

et al. 2020

2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

View full text Add to dashboard Cite

show abstract

“…In HPC, the memory wall is one of the main limitations of FPGAs for applications. The memory requires a controller to reorder requests to minimize row conflicts, and as a consequence the throughput depends on memory controller implementation [15], [16]. The behaviour of memory controllers is often overlooked [17], [18] or simplified as in the performance model proposed by Wang et al [8] for Intel OpenCL SDK.…”

Section: State Of the Artmentioning

confidence: 99%

“…FlexCL improves models covering memory access patterns with a short CPU/GPU execution, but it continues being the main source of error of the model. As some comparisons show, the memory controller makes differences in the access pattern and hence performance [15], [19]- [21]; moreover, CPU/GPU devices have a more sophisticated memory hierarchy that can hide DRAM latency. As well as the memory controller, the memory standard or technology changes the interaction with the FPGA pipeline.…”

Section: State Of the Artmentioning

confidence: 99%

Analytical Model for Memory-Centric High Level Synthesis-Generated Applications

Davila-Raigoza¹,

Tejero²,

Villarroya-Gaudó³

et al. 2021

IEEE Trans. Comput.

View full text Add to dashboard Cite

High performance computing (HPC) demands huge memory bandwidth and computing resources to achieve maximum performance and energy efficiency. FPGAs can provide both, and with the help of High Level Synthesis, those HPC applications can be easily written in high level languages. However, the optimization process remains time-consuming, especially when based on trial-and-error bitstream generation. Model-based performance prediction is a practical and fast approach for kernel optimization, specially if done with information from pre-synthesis reports. This article presents an analytical model focused on memory intensive applications that captures the memory behavior and accurately predicts the kernel execution time within seconds rather than hours, as bitstream generation requires. The model has been validated with two DRAM technologies: DDR4 and HBM2, with a set of microbenchmarks and high performance computing applications showing an average error of 11% for DDR4 and 10% for HBM2. Compared with previous studies, our predictions at least halve the estimation error.

show abstract

“…II. RELATED WORKS 148Several research works have investigated FPGAs perfor-149 mance when used as hardware accelerators, mostly using 150 synthetic benchmarks to estimate the bandwidth of off-chip 151 memories[26],[27],[28], and OpenCL kernels to measure 152 the FPGA computing performance[29],[30],[31]. However, 153 only few tools utilize the Roofline Model, and none assess 154 also the on-chip memories bandwidth.155In[26] is presented the Shuhai Verilog benchmark, 156 used to characterize the performance of HBM and DDR 157 off-chip memories embedded in the Xilinx Alveo U280.…”

mentioning

confidence: 99%

FER: A Benchmark for the Roofline Analysis of FPGA Based HPC Accelerators

Calore

Schifano

2022

IEEE Access

View full text Add to dashboard Cite

Nowadays, the use of hardware accelerators to boost the performance of HPC applications is a consolidated practice, and among others, GPUs are by far the most widespread. More recently, some data centers have successfully deployed also FPGA accelerated systems, especially to boost machine learning inference algorithms. Given the growing use of machine learning methods in various computational fields, and the increasing interest towards reconfigurable architectures, we may expect that in the near future FPGA based accelerators will be more common in HPC systems, and that they could be exploited also to accelerate general purpose HPC workloads. In view of this, tools able to benchmark FPGAs in the context of HPC are necessary for code developers to estimate the performance of applications, as well as for computer architects to model that of systems at scale. To fulfill these needs, we have developed FER (FPGA Empirical Roofline), a benchmarking tool able to empirically measure the computing performance of FPGA based accelerators, as well as the bandwidth of their on-chip and off-chip memories. FER measurements enable to draw Roofline plots for FPGAs, allowing for performance comparisons with other processors, such as CPUs and GPUs, and to estimate at the same time the performance upper-bounds that applications could achieve on a target device. In this paper we describe the theoretical model on which FER relies, its implementation details, and the results measured on Xilinx Alveo accelerator cards. 134 = 536 GFLOP/s, (9) 723 resulting approximately 20% higher with respect to the max-724 imum performance we measured empirically with FER, and 725 reported in Fig. 3. 726 Concerning the on-chip memories, such as URAMs, 727 we can use a similar approach to estimate their maxi-728 mum bandwidth. Using the conservative values suggested by 729 Xilinx best practices, in this case 300 MHz of clock frequency 730 and 80% as utilization factor, Eq. 4 gives: 731 B uram = 300 MHz × 64 bit × 2 × 1280 × 0.8 732 = 4.91 TB/s, (10) 733 where 1280 is the amount of available dual-port (thus we 734 multiply by 2 their number) URAMs. Each URAM block is 735 72 bits wide, but with ECC (Error Correction Code) enabled 736 it offers 64 bits wide protected data words. In this paper we 737 always consider ECC to be enabled. The maximum band-738 width would be 6.1TB/s with a 100% utilization. 739 Concerning the off-chip memory bandwidth, assuming that 740 this is not limited by the user design, the maximum value 741 estimated by Eq. 6 for the 4 DDR4 banks results in: 742 does not impact on the local memory performance.943 C. CROSS-ARCHITECTURAL COMPARISON 944 Using the DP-FP FMAs as main mathematical operation, for 945 which the floating point accuracy is granted to be compliant 946 with the IEEE-754 standard [53], we can also use FER results 947 to compare FPGAs with commodity processors. 948 In Fig. 6 we compare the Roofline plots of U50, U250 and 949 U280 FPGAs, with that of Intel Xeon Gold 6130 (based on 950 Skylake micro-architecture) measure...

show abstract

The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface

Cited by 15 publications

References 9 publications

Shuhai: Benchmarking High Bandwidth Memory On FPGAS

Shuhai: Benchmarking High Bandwidth Memory On FPGAS

Analytical Model for Memory-Centric High Level Synthesis-Generated Applications

FER: A Benchmark for the Roofline Analysis of FPGA Based HPC Accelerators

Contact Info

Product

Resources

About