Demystifying the Memory System of Modern Datacenter FPGAs for Software Programmers through Microbenchmarking

Self Cite

Both modern datacenter and embedded FPGAs provide great opportunities for high-performance and high energy-efficiency computing. With the growing public availability of FPGAs from major cloud service providers such as AWS, Alibaba, and Nimbix, as well as uniform hardware accelerator development tools (such as Xilinx Vitis and Intel oneAPI) for software programmers, hardware and software developers can now easily access FPGA platforms. However, it is nontrivial to develop efficient FPGA accelerators, especially for software programmers who use high-level synthesis (HLS). The major goal of this paper is to figure out how to efficiently access the memory system of modern datacenter and embedded FPGAs in HLS-based accelerator designs. This is especially important for memory-bound applications; for example, a naive accelerator design only utilizes less than 5% of the available off-chip memory bandwidth. To achieve our goal, we first identify a comprehensive set of factors that affect the memory bandwidth, including 1) the clock frequency of the accelerator design, 2) the number of concurrent memory access ports, 3) the data width of each port, 4) the maximum burst access length for each port, and 5) the size of consecutive data accesses. Then we carefully design a set of HLS-based microbenchmarks to quantitatively evaluate the performance of the memory systems of datacenter FPGAs (Xilinx Alveo U200 and U280) and embedded FPGA (Xilinx ZCU104) when changing those affecting factors, and provide insights into efficient memory access in HLS-based accelerator designs. Comparing between the typically used soft and hardened memory systems respectively found on datacenter and embedded FPGAs, we further summarize their unique features and discuss the effective approaches to leverage these systems. To demonstrate the usefulness of our insights, we also conduct two case studies to accelerate the widely used K-nearest neighbors (KNN) and sparse matrix-vector multiplication (SpMV) algorithms on datacenter FPGAs with a soft (and thus more flexible) memory system. Compared to the baseline designs, optimized designs leveraging our insights achieve about 3.5x and 8.5x speedups for the KNN and SpMV accelerators. Our final optimized KNN and SpMV designs on a Xilinx Alveo U200 FPGA fully utilize its off-chip memory bandwidth, and achieve about 5.6x and 3.4x speedups over the 24-core CPU implementations.

Section: Methodsmentioning

confidence: 82%

Section: Design Challenges and Solutionsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking

Fang

Shannon

2022

Self Cite

“…By integrating it into a POWER9 server, the energy consumption has been reduced by 29× comparing to the CPU-only system. To make the advantage of FPGAs+HBM more accessible to software developers, researchers have proposed HLS-based optimizations for fully utilizing the HBM bandwidth [41]. With this efort, the performance of HLS-based implementations is improved by 3.5× and 8.5× in the applications of K-nearest neighbors (KNN) and sparse matrix-vector multiplication (SpMV).…”

Section: A Performance-utilization Trade-ofmentioning

confidence: 99%

Exploiting HBM on FPGAs for Data Processing

Shi

Kara²,

Hagleitner

et al. 2022

FPGAs are increasingly being used in data centers and the cloud due to their potential to accelerate certain workloads as well as for their architectural flexibility since they can be used as accelerators, as smart-NICs, or a stand-alone processors. To meet the challenges posed by these new use cases, FPGAs are quickly evolving in terms of their capabilities and organization. The utilization of High Bandwidth Memory (HBM) in FPGA devices is one recent example of such a trend. In this paper, we study the potential of FPGAs equipped with HBM from a data analytics perspective. We consider three workloads common in analytics oriented databases and implement them on an FPGA showing in which cases they benefit from HBM: range selection, hash join, and stochastic gradient descent for linear model training. We integrate our designs into a columnar database (MonetDB) and show the trade-offs arising from the integration related to data movement and partitioning. We consider two possible configurations of the HBM, using a single and a dual clock version design. With the right design, FPGA+HBM based solutions are able to surpass the highest performance provided by either a 2-socket POWER9 1 system or a 14-core Xeon 2 E5 by up to 5.9x (range selection), 18.3x (hash join), and 6.1x (SGD).

“…In SyncNN, we use a hierarchical on-chip bufering technique to bufer as many weights as possible, depending on the network size and the on-chip memory size available on the FPGA board. We load the weights in a coalesced (widened bus) and burst fashion [17,40] from the of-chip memory to on-chip memory at diferent granularity. As shown in Figure 5, for every convolutional layer, the weights are resolved in four dimensions.…”

Section: Memory Access Optimizationmentioning

confidence: 99%

SyncNN: Evaluating and Accelerating Spiking Neural Networks on FPGAs

Panchapakesan

Fang

Li³

2022

Self Cite

Compared to conventional artificial neural networks, Spiking Neural Networks (SNNs) are more biologically plausible and require less computation due to their event-driven nature of spiking neurons. However, the default asynchronous execution of SNNs also poses great challenges to accelerate their performance on FPGAs. In this work, we present a novel synchronous approach for rate encoding based SNNs, which is more hardware friendly than conventional asynchronous approaches. We first quantitatively evaluate and mathematically prove that the proposed synchronous approach and asynchronous implementation alternatives of rate encoding based SNNs are similar in terms of inference accuracy and we highlight the computational performance advantage of using SyncNN over asynchronous approach. We also design and implement the SyncNN framework to accelerate SNNs on Xilinx ARM-FPGA SoCs in a synchronous fashion. To improve the computation and memory access efficiency, we first quantize the network weights to 16-bit, 8-bit, and 4-bit fixed-point values with the SNN friendly quantization techniques. Moreover, we encode only the activated neurons by recording their positions and corresponding number of spikes to fully utilize the event-driven characteristics of SNNs, instead of using the common binary encoding (i.e., 1 for a spike and 0 for no spike). For the encoded neurons that have dynamic and irregular access patterns, we design parameterized compute engines to accelerate their performance on the FPGA, where we explore various parallelization strategies and memory access optimizations. Our experimental results on multiple Xilinx ARM-FPGA SoC boards demonstrate that our SyncNN is scalable to run multiple networks, such as LeNet, Network in Network, and VGG, on various datasets such as MNIST, SVHN, and CIFAR-10. SyncNN not only achieves competitive accuracy (99.6%) but also achieves state-of-the-art performance (13,086 frames per second) for the MNIST dataset. Finally, we compare the performance of SyncNN with conventional CNNs using the Vitis AI and find that SyncNN can achieve similar accuracy and better performance compared to Vitis AI for image classification using small networks.