Shuhai: A Tool for Benchmarking High Bandwidth Memory on FPGAs

Huang, Hongjing; Wang, Zeke; Zhang, Jie; He, Zhenhao; Wu, Chao; Xiao, Jun; Alonso, Gustavo

doi:10.1109/tc.2021.3075765

Cited by 27 publications

(10 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A rather more interesting comparison is the number of distinct items and the size of the stream each system can handle. Using 32-bit items, as in the case of all presented accelerators, allows for streams with up to 2 32 distinct items, which is sufficient for many practical applications. This includes product IDs in market basket data, IPv4 addresses, port numbers in network traffic or hashes of larger data structures.…”

Section: Fpga Implementationmentioning

confidence: 99%

Fast approximation of the top‐k items in data streams using FPGAs

Ebrahim

Khalifat

2023

IET Computers & Digital Tech

View full text Add to dashboard Cite

Two methods are presented for finding the top-k items in data streams using Field Programmable Gate Arrays (FPGAs). These methods deploy two variants of a novel accelerator architecture capable of extracting an approximate list of the topmost frequently occurring items in a single pass over the input stream without the need for random access. The first variant of the accelerator implements the well-known Probabilistic sampling algorithm by mapping its main processing stages to a hardware architecture consisting of two custom systolic arrays. The proposed architecture retains all the properties of this algorithm, which works even if the stream size is unknown at run time.The architecture shows better scalability compared to other architectures that are based on other stream algorithms. In addition, experimental results on both synthetic and real datasets, when implementing the accelerator on an Intel Arria 10 GX 1150 FPGA device, showed very good accuracy and significant throughput gains compared to the existing software and hardware-accelerated solutions. The second variant of the accelerator is specifically tailored for applications requiring higher accuracy, provided that the size of the stream is known at run time. This variant takes advantage of the embedded memory resources in an FPGA to implement a sketch-based filter that precedes the main systolic array in the accelerator's pipeline. This filter enhances the accuracy of the accelerator by pre-processing the stream to remove much of the insignificant items, allowing the accelerator to process a significantly smaller filtered stream.

show abstract

Section: Fpga Implementationmentioning

confidence: 99%

Fast approximation of the top‐k items in data streams using FPGAs

Ebrahim

Khalifat

2023

IET Computers & Digital Tech

View full text Add to dashboard Cite

show abstract

“…HBM usually comes with a number of independent interfaces called channels, each of which accesses an independent set of DRAM banks. For example, the Xilinx Alveo U280 platform [67] has 32 independent HBM channels, providing up to 460 GB/s of theoretical memory bandwidth, 425 GB/s in practical [29]. This is almost 6× higher than the Alveo U250, a platform with four traditional DRAM channels.…”

Section: Hbm On Fpgasmentioning

confidence: 99%

“…As introduced in Section 2.2, Xilinx HBM-enabled platforms have a built-in crossbar that allows memory ports of user kernels to access any pseudo channel of the HBM stack. However, potential concurrent accesses to the same channel concurrently signiicantly reduce bandwidth due to congestion and the physical limit of one channel [16,29]. Like the original design, we access all HBM channels independently in a burst manner to maximize the memory bandwidth.…”

Section: Hbm Specific Optimizationsmentioning

confidence: 99%

ThunderGP: Resource-Efficient Graph Processing Framework on FPGAs with HLS

Chen

Cheng

Tan

et al. 2022

ACM Trans. Reconfigurable Technol. Syst.

View full text Add to dashboard Cite

FPGA has been an emerging computing infrastructure in datacenters benefiting from fine-grained parallelism, energy efficiency, and reconfigurability. Meanwhile, graph processing has attracted tremendous interest in data analytics, and its performance is in increasing demand with the rapid growth of data. Many works have been proposed to tackle the challenges of designing efficient FPGA-based accelerators for graph processing. However, the largely overlooked programmability still requires hardware design expertise and sizable development efforts from developers. ThunderGP , an HLS-based graph processing framework on FPGAs, is hence proposed to close the gap, with which developers could enjoy high performance of FPGA-accelerated graph processing by writing only a few high-level functions with no knowledge of the hardware. ThunderGP adopts the gather-apply-scatter (GAS) model as the abstraction of various graph algorithms and realizes the model by a build-in highly parallel and memory-efficient accelerator template. With high-level functions as inputs, ThunderGP automatically explores massive resources of multiple super-logic regions (SLRs) of modern FPGA platforms to generate and deploy accelerators, as well as schedule tasks for them. While ThunderGP on DRAM-based platforms is memory bandwidth bounded, recent high bandwidth memory (HBM) brings large potentials to performance. However, the system bottleneck shifts from memory bandwidth to resource consumption on HBM-enabled platforms. Therefore, we further propose to improve resource efficiency of ThunderGP to utilize more memory bandwidth from HBM. We conduct evaluation with seven common graph applications and nineteen graphs. ThunderGP on DRAM-based hardware platforms provides 1.9 × ∼ 5.2 × improvement on bandwidth efficiency over the state-of-the-art, while ThunderGP on HBM-based hardware platforms delivers up to 5.2 × speedup over the state-of-the-art RTL-based approach. This work is open-sourced on Github at https://github.com/Xtra-Computing/ThunderGP.

show abstract

“…Lastly, we model C i acs v based on the architecture of the Vertex Loader and Ping-Pong Buffer in Big and Little pipelines, respectively. As the Vertex Loader directly accesses the memory for different requests without caching and prefetching, we benchmark the memory access latency with varying access distance (stride) on the test FPGAs [18]. The benchmark results show that the C i acs v of the Big pipeline can be modeled by a linear function with respect to access distance, as shown in Equation ( 4 upper bound and a lower bound, as there exists the worst-case and best-case memory access latency.…”

Section: A Performance Modeling Of Big-little Pipelinesmentioning

confidence: 99%

ReGraph: Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines

Chen¹,

Chen²,

Cao³

et al. 2022

Preprint

View full text Add to dashboard Cite

The use of FPGAs for efficient graph processing has attracted significant interest. Recent memory subsystem upgrades including the introduction of HBM in FPGAs promise to further alleviate memory bottlenecks. However, modern multi-channel HBM requires much more processing pipelines to fully utilize its bandwidth potential. Existing designs do not scale well, resulting in underutilization of the HBM facilities even when all other resources are fully consumed.In this paper, we re-examined the graph processing workloads and found much diversity in processing. We also found that the diverse workloads can be easily classified into two types, namely dense and sparse partitions. This motivates us to propose a resource-efficient heterogeneous pipeline architecture. Our heterogeneous architecture comprises of two types of pipelines: Little pipelines to process dense partitions with good locality and Big pipelines to process sparse partitions with extremely poor locality. Unlike traditional monolithic pipeline designs, the heterogeneous pipelines are tailored for more specific memory access patterns, and hence are more lightweight, allowing the architecture to scale up more effectively with limited resources. In addition, we propose a model-guided task scheduling method that schedules partitions to the right pipeline types, generates the most efficient pipeline combination and balances workloads. Furthermore, we develop an automated open-source framework, called ReGraph 1 , which automates the entire development process. ReGraph outperforms state-of-the-art FPGA accelerators by up to 5.9× in terms of performance and 12× in terms of resource efficiency.

show abstract

Shuhai: A Tool for Benchmarking High Bandwidth Memory on FPGAs

Cited by 27 publications

References 28 publications

Fast approximation of the top‐k items in data streams using FPGAs

Fast approximation of the top‐k items in data streams using FPGAs

ThunderGP: Resource-Efficient Graph Processing Framework on FPGAs with HLS

ReGraph: Scaling Graph Processing on HBM-enabled FPGAs with Heterogeneous Pipelines

Contact Info

Product

Resources

About