2022
DOI: 10.1109/tc.2021.3075765
|View full text |Cite
|
Sign up to set email alerts
|

Shuhai: A Tool for Benchmarking High Bandwidth Memory on FPGAs

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2022
2022
2025
2025

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 27 publications
(10 citation statements)
references
References 28 publications
0
10
0
Order By: Relevance
“…A rather more interesting comparison is the number of distinct items and the size of the stream each system can handle. Using 32-bit items, as in the case of all presented accelerators, allows for streams with up to 2 32 distinct items, which is sufficient for many practical applications. This includes product IDs in market basket data, IPv4 addresses, port numbers in network traffic or hashes of larger data structures.…”
Section: Fpga Implementationmentioning
confidence: 99%
“…A rather more interesting comparison is the number of distinct items and the size of the stream each system can handle. Using 32-bit items, as in the case of all presented accelerators, allows for streams with up to 2 32 distinct items, which is sufficient for many practical applications. This includes product IDs in market basket data, IPv4 addresses, port numbers in network traffic or hashes of larger data structures.…”
Section: Fpga Implementationmentioning
confidence: 99%
“…HBM usually comes with a number of independent interfaces called channels, each of which accesses an independent set of DRAM banks. For example, the Xilinx Alveo U280 platform [67] has 32 independent HBM channels, providing up to 460 GB/s of theoretical memory bandwidth, 425 GB/s in practical [29]. This is almost 6× higher than the Alveo U250, a platform with four traditional DRAM channels.…”
Section: Hbm On Fpgasmentioning
confidence: 99%
“…As introduced in Section 2.2, Xilinx HBM-enabled platforms have a built-in crossbar that allows memory ports of user kernels to access any pseudo channel of the HBM stack. However, potential concurrent accesses to the same channel concurrently signiicantly reduce bandwidth due to congestion and the physical limit of one channel [16,29]. Like the original design, we access all HBM channels independently in a burst manner to maximize the memory bandwidth.…”
Section: Hbm Specific Optimizationsmentioning
confidence: 99%
“…Lastly, we model C i acs v based on the architecture of the Vertex Loader and Ping-Pong Buffer in Big and Little pipelines, respectively. As the Vertex Loader directly accesses the memory for different requests without caching and prefetching, we benchmark the memory access latency with varying access distance (stride) on the test FPGAs [18]. The benchmark results show that the C i acs v of the Big pipeline can be modeled by a linear function with respect to access distance, as shown in Equation ( 4 upper bound and a lower bound, as there exists the worst-case and best-case memory access latency.…”
Section: A Performance Modeling Of Big-little Pipelinesmentioning
confidence: 99%