Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

Licht, Johannes de Fine; Kwasniewski, Grzegorz; Hoefler, Torsten

doi:10.1145/3373087.3375296

Cited by 41 publications

(28 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [14] the authors propose a model to optimize matrix multiplication for FPGA platforms by maximizing performance (computations) and minimizing off-chip I/O accesses. They apply their model to a particular implementation in FPGA using HLS obtaining competitive performance while maintaining high levels of abstraction in the code that allows portability between platforms.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Energy-efficient algebra kernels in FPGA for High Performance Computing

Favaro¹,

Dufrechou

Ezzatti

et al. 2021

JCS&T

View full text Add to dashboard Cite

The dissemination of multi-core architectures and the later irruption of massively parallel devices, led to a revolution in High-Performance Computing (HPC) platforms in the last decades. As a result, Field-Programmable Gate Arrays (FPGAs) are re-emerging as a versatile and more energy-efficient alternative to other platforms. Traditional FPGA design implies using low-level Hardware Description Languages (HDL) such as VHDL or Verilog, which follow an entirely different programming model than standard software languages, and their use requires specialized knowledge of the underlying hardware. In the last years, manufacturers started to make big efforts to provide High-Level Synthesis (HLS) tools, in order to allow a grater adoption of FPGAs in the HPC community.Our work studies the use of multi-core hardware and different FPGAs to address Numerical Linear Algebra (NLA) kernels such as the general matrix multiplication GEMM and the sparse matrix-vector multiplication SpMV. Specifically, we compare the behavior of fine-tuned kernels in a multi-core CPU processor and HLS implementations on FPGAs. We perform the experimental evaluation of our implementations on a low-end and a cutting-edge FPGA platform, in terms of runtime and energy consumption, and compare the results against the Intel MKL library in CPU.

show abstract

Section: Related Workmentioning

confidence: 99%

“…For the Xilinx's Data Center platform we tested the GEMM implementation from [14] and developed our version of SPMV based on the work in [18].…”

Section: High-end Fpgamentioning

confidence: 99%

Energy-efficient algebra kernels in FPGA for High Performance Computing

Favaro¹,

Dufrechou

Ezzatti

et al. 2021

JCS&T

View full text Add to dashboard Cite

show abstract

“…A very recent paper (de Fine Licht et al, 2019) investigates a high-level synthesis on the FPGA platform. The authors propose a model to optimize the Matrix Matrix Multiplication (MMM) algorithm.…”

Section: Related Workmentioning

confidence: 99%

CFD code adaptation to the FPGA architecture

Rojek¹,

Halbiniak²,

Kuczynski

2020

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

For the last years, we observe the intensive development of accelerated computing platforms. Although current trends indicate a well-established position of GPU devices in the HPC environment, FPGA (Field-Programmable Gate Array) aspires to be an alternative solution to offload the CPU computation. This paper presents a systematic adaptation of four various CFD (Computational Fluids Dynamic) kernels to the Xilinx Alveo U250 FPGA. The goal of this paper is to investigate the potential of the FPGA architecture as the future infrastructure able to provide the most complex numerical simulations in the area of fluid flow modeling. The selected kernels are customized to a real-scientific scenario, compatible with the EULAG (Eulerian/semi-Lagrangian) fluid solver. The solver is used to simulate thermo-fluid flows across a wide range of scales and is extensively used in numerical weather prediction. The proposed adaptation is focused on the analysis of the strengths and weaknesses of the FPGA accelerator, considering performance and energy efficiency. The proposed adaptation is compared with a CPU implementation that was strongly optimized to provide realistic and objective benchmarks. The performance results are compared with a set of server CPUs containing various Intel generations, including Intel SkyLake-based CPUs as Xeon Gold 6148 and Xeon Platinum 8168, as well as Intel Xeon E5-2695 CPU based on the IvyBridge architecture. Since all the kernels belong to the group of memory-bound algorithms, our main challenge is to saturate global memory bandwidth and provide data locality with the intensive BRAM (Block RAM) reusing. Our adaptation allows us to reduce the performance per watt up to 80% compared to the CPUs.

show abstract

“…• Challenge 3 -How to design a general-purpose accelerator which does not need to be rerun the time-consuming flow of synthesis/place/route. While many accelerators have been designed for boosting computing performance and efficiency in many application domains such as deep learning [5, 11, 12, 23, 31, 35, 64-69, 77, 87, 88], dense linear algebra [23,29,30,35,77], graph processing [4,17,25,26,39,70,89,91,92,95], genomic and bio analysis [8,9,13,14,33,38,51,76,81], data sorting [10,52,60,63], most are designed for one specific problem with fixed input and output size. For FPGA accelerators even with improved tools such as [17,77], a new design will still consume many hours or even a few days due to long synthesis and place/route time.…”

Section: Introductionmentioning

confidence: 99%

Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication

Song

Chi

Sohrabizadeh

et al. 2022

Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

View full text Add to dashboard Cite

Sparse-Matrix Dense-Matrix multiplication (SpMM) is the key operator for a wide range of applications including scientific computing, graph processing, and deep learning. Architecting accelerators for SpMM is faced with three challenges -(1) the random memory accessing and unbalanced load in processing because of random distribution of elements in sparse matrices, (2) inefficient data handling of the large matrices which can not be fit on-chip, and (3) a non-general-purpose accelerator design where one accelerator can only process a fixed-size problem.In this paper, we present Sextans, an accelerator for generalpurpose SpMM processing. Sextans accelerator features (1) fast random access using on-chip memory, (2) streaming access to offchip large matrices, (3) PE-aware non-zero scheduling for balanced workload with an II=1 pipeline, and (4) hardware flexibility to enable prototyping the hardware once to support SpMMs of different size as a general-purpose accelerator. We leverage high bandwidth memory (HBM) for the efficient accessing of both sparse and dense matrices. In the evaluation, we present an FPGA prototype Sextans which is executable on a Xilinx U280 HBM FPGA board and a projected prototype Sextans-P with higher bandwidth competitive to V100 and more frequency optimization. We conduct a comprehensive evaluation on 1,400 SpMMs on a wide range of sparse matrices including 50 matrices from SNAP and 150 from SuiteSparse. We compare Sextans with NVIDIA K80 and V100 GPUs. Sextans achieves a 2.50x geomean speedup over K80 GPU and Sextans-P achieves a 1.14x geomean speedup over V100 GPU (4.94x over K80). The code is available at https://github.com/linghaosong/Sextans.

show abstract

Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis

Cited by 41 publications

References 31 publications

Energy-efficient algebra kernels in FPGA for High Performance Computing

Energy-efficient algebra kernels in FPGA for High Performance Computing

CFD code adaptation to the FPGA architecture

Sextans: A Streaming Accelerator for General-Purpose Sparse-Matrix Dense-Matrix Multiplication

Contact Info

Product

Resources

About