Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking

Lu, Alec; Fang, Zhenman; Shannon, Lesley

doi:10.1145/3517131

Cited by 3 publications

(1 citation statement)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, for each variable in an unsatisfied clause (line 8), approximately O × K ×4B of data will be read from DRAM. Considering that the DRAM channels in Alveo families have an initial access latency of 110 ns and a peak read bandwidth of 17.9 GB/s [27], the time required to read the clause and the literal indices can be approximated as (110 ns + (O × K ×4B) / 17.9 GB/s). Then the time taken for a flip can be estimated by considering that the loop in line 8 iterates K times and that there are 4 DRAM Based on the estimation model described above, Table 5 presents the FPGA-only throughput comparison between FYalSAT and the conventional WalkSAT FPGA accelerator architectures [14], [16], [17].…”

Section: ) Throughput and Resource Consumptionmentioning

confidence: 99%

FYalSAT: High-Throughput Stochastic Local Search K-SAT Solver on FPGA

Choi,

Kim

2024

IEEE Access

View full text Add to dashboard Cite

The satisfiability (SAT) problem is a fundamental challenge in computing and has a broad range of applications. This problem is NP-complete, and many algorithmic and architectural improvements have aimed at accelerating the SAT solver. But most existing stochastic local search (SLS) hardware solvers still rely on the outdated WalkSAT algorithm, and they have a reduced performance when handling problems with a large number of literals per clause. In this paper, we present FYalSAT, a field-programmable gate array (FPGA) based SLS SAT solver designed for high throughput. We incorporate a conflict-free data rearrangement scheme and a novel synchronization method to increase the parallelism. We also apply various optimizations such as clause prefetching, module overlapping, and pipelining to improve the performance. Experimental results demonstrate that FYalSAT outperforms the throughput of existing SLS FPGA solvers by 9.07×-110× for benchmarks with a large number of literals per clause.INDEX TERMS field-programmable gate arrays, satisfiability problem, stochastic local search, accelerator architecture I. INTRODUCTION

show abstract

Section: ) Throughput and Resource Consumptionmentioning

confidence: 99%

FYalSAT: High-Throughput Stochastic Local Search K-SAT Solver on FPGA

Choi,

Kim

2024

IEEE Access

View full text Add to dashboard Cite

show abstract

CHIP-KNNv2: AConfigurable andHigh-PerformanceK-NearestNeighbors Accelerator on HBM-based FPGAs

Liu,

Lu,

Samtani

et al. 2023

ACM Trans. Reconfigurable Technol. Syst.

View full text Add to dashboard Cite

The k-nearest neighbors (KNN) algorithm is an essential algorithm in many applications, such as similarity search, image classification, and database query. With the rapid growth in the dataset size and the feature dimension of each data point, processing KNN becomes more compute and memory hungry. Most prior studies focus on accelerating the computation of KNN using the abundant parallel resource on FPGAs. However, they often overlook the memory access optimizations on FPGA platforms and only achieve a marginal speedup over a multi-thread CPU implementation for large datasets. In this paper, we design and implement CHIP-KNN: an HLS-based, configurable, and high-performance KNN accelerator. CHIP-KNN optimizes the off-chip memory access on modern HBM-based FPGAs such as the AMD/Xilinx Alveo U280 FPGA board. CHIP-KNN is configurable for all essential parameters used in the algorithm, including the size of the search dataset, the feature dimension and data type representation of each data point, the distance metric, and the number of nearest neighbors - K. In terms of design architecture, we explore and discuss the trade-offs between two design versions: CHIP-KNNv1 (Ping-Pong buffer based) and CHIP-KNNv2 (streaming-based). Moreover, we investigate the routing congestion issue in our accelerator design, implement hierarchical structures to shorten critical paths, and integrate an open-source floorplanning optimization tool called TAPA/AutoBridge to eliminate the place-and-route issues. To explore the design space and balance the computation and memory access performance, we also build an analytical performance model. Given a user configuration of the KNN parameters, our tool can automatically generate TAPA HLS C code for the optimal accelerator design and the corresponding host code, on the HBM-based FPGA platform. Our experimental results on the Alveo U280 show that, compared to a 48-thread CPU implementation, CHIP-KNNv2 achieves a geomean performance speedup of 15x, with a maximum speedup of 45x. Additionally, we show that CHIP-KNNv2 achieves up to 2.1x performance speedup over CHIP-KNNv1 while increasing configurability. Compared with the state-of-the-art Facebook AI Similarity Search (FAISS) [23] GPU implementation running on a Nvidia Tesla V100 GPU, CHIP-KNNv2 achieves an average latency reduction of 30.6x while requiring 34.3% of GPU power consumption.

show abstract

PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAs

Khatti,

Tian,

Sedigh Baroughi

et al. 2024

ACM Trans. Reconfigurable Technol. Syst.

View full text Add to dashboard Cite

In recent years, the adoption of FPGAs in datacenters has increased, with a growing number of users choosing High-Level Synthesis (HLS) as their preferred programming method. While HLS simplifies FPGA programming, one notable challenge arises when scaling up designs for modern datacenter FPGAs that comprise multiple dies. The extra delays introduced due to die crossings and routing congestion can significantly degrade the frequency of large designs on these FPGA boards. Due to the gap between HLS design and physical design, it is challenging for HLS programmers to analyze and identify the root causes, and fix their HLS design to achieve better timing closure. Recent efforts have aimed to address these issues by employing coarse-grained floorplanning and pipelining strategies on task-parallel HLS designs where multiple tasks run concurrently and communicate through FIFO stream channels. However, many applications are not streaming friendly and many existing accelerator designs heavily rely on buffer channel based communication between tasks. In this work, we take a step further to support a task-parallel programming model where tasks can communicate via both FIFO stream channels and buffer channels. To achieve this goal, we design and implement the PASTA framework, which takes a large task-parallel HLS design as input and automatically generates a high-frequency FPGA accelerator via HLS and physical design co-optimization. Our framework introduces a latency-insensitive buffer channel design, which supports memory partitioning and ping-pong buffering while remaining compatible with vendor HLS tools. On the frontend, we provide an easy-to-use programming model for utilizing the proposed buffer channel; while on the backend, we implement efficient placement and pipelining strategies for the proposed buffer channel. To validate the effectiveness of our framework, we test it on four widely used Rodinia HLS benchmarks and two real-world accelerator designs and show an average frequency improvement of 25%, with peak improvements of up to 89% on AMD/Xilinx Alveo U280 boards compared to Vitis HLS baselines.

show abstract

Demystifying the Soft and Hardened Memory Systems of Modern FPGAs for Software Programmers through Microbenchmarking

Cited by 3 publications

References 30 publications

FYalSAT: High-Throughput Stochastic Local Search K-SAT Solver on FPGA

FYalSAT: High-Throughput Stochastic Local Search K-SAT Solver on FPGA

CHIP-KNNv2: AConfigurable andHigh-PerformanceK-NearestNeighbors Accelerator on HBM-based FPGAs

PASTA: Programming and Automation Support for Scalable Task-Parallel HLS Programs on Modern Multi-Die FPGAs

Contact Info

Product

Resources

About