Deploying Hash Tables on Die-Stacked High Bandwidth Memory

Xing, Cheng; He, Bingsheng; Lo, Eric; Wang, Wei; Lu, Shengliang; Chen, Xinyu

doi:10.1145/3357384.3358015

Cited by 7 publications

(4 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Second, data processing with HBM/HMC. Previous work [16], [17], [18], [19], [20], [21], [22], [23] employs HBM to accelerate their applications, e.g., hash table deep learning and streaming, by leveraging the high memory bandwidth provided by Intel Knights Landing (KNL)s HBM [24]. In contrast, we benchmark the performance of HBM on the Xilinx FPGA.…”

Section: Related Workmentioning

confidence: 99%

Shuhai: Benchmarking High Bandwidth Memory On FPGAS

Wang

Huang

Zhang

et al. 2020

2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

View full text Add to dashboard Cite

FPGAs are starting to be enhanced with High Bandwidth Memory (HBM) as a way to reduce the memory bandwidth bottleneck encountered in some applications and to give the FPGA more capacity to deal with application state. However, the performance characteristics of HBM are still not well specified, especially in the context of FPGAs. In this paper, we bridge the gap between nominal specifications and actual performance by benchmarking HBM on a state-of-the-art FPGA, i.e., a Xilinx Alveo U280 featuring a two-stack HBM subsystem. To this end, we propose Shuhai, a benchmarking tool that allows us to demystify all the underlying details of HBM on an FPGA. FPGA-based benchmarking should also provide a more accurate picture of HBM than doing so on CPUs/GPUs, since CPUs/GPUs are noisier systems due to their complex control logic and cache hierarchy. Since the memory itself is complex, leveraging custom hardware logic to benchmark inside an FPGA provides more details as well as accurate and deterministic measurements. We observe that 1) HBM is able to provide up to 425 GB/s memory bandwidth, and 2) how HBM is used has a significant impact on performance, which in turn demonstrates the importance of unveiling the performance characteristics of HBM so as to select the best approach. Shuhai can be easily generalized to other FPGA boards or other generations of memory, e.g., HBM3, and DDR3. We will make Shuhai open-source, benefiting the community.

show abstract

Section: Related Workmentioning

confidence: 99%

Shuhai: Benchmarking High Bandwidth Memory On FPGAS

Wang

Huang

Zhang

et al. 2020

2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

View full text Add to dashboard Cite

show abstract

“…Multicore processors also take advantage of HBM, such as Intel's Knights Landing, NVIDIA's Titan V, and Google's TPU. Recent research in this area has focused on demonstrating the utility of HBM data-intensive processing issues, such as hash tables [38], graph processing [39], and stream processing [40]. When it comes to the process of expediting text search, there are two obstacles related to accessing the external memory:…”

Section: High Bandwidth Memorymentioning

confidence: 99%

A High-Performance Non-Indexed Text Search System

Kieu-Do-Nguyen,

Dang,

The Binh

et al. 2024

Electronics

View full text Add to dashboard Cite

Full-text search has a wide range of applications, including tracking systems, computer vision, and natural language processing. Standard methods usually implement a two-phase procedure: indexing and retrieving, with the retrieval performance entirely dependent on the index efficiency. In most cases, the more powerful the index algorithm, the more memory and processing time are required. The amount of time and memory required to index a collection of documents is proportional to its overall size. In this paper, we propose a full-text search hardware implementation without the indexing phase, thus removing the time and memory requirements for indexing. Additionally, we propose an efficient design to leverage the parallel architecture of High Bandwidth Memory (HBM). To our knowledge, few (if not zero) researchers have integrated their full-text search system with an effective data access control on HBM. The functionality of the proposed system is verified on the Xilinx Alveo U50 Field-Programmable Gate Array (FPGA). The experimental results show that our system achieved a throughput of 8 Gigabytes per second, about 6697× speed-up compared to other software-based approaches.

show abstract

“…KNL being an x86 many-core architecture offers easy portability for existing codebases and allows rapid testing of HBM-related ideas. Cheng et al [41] focus on optimizing NUMA placement of hash tables on KNL to increase the utilization of the HBM and provide simulation results for hash join. Pohl et al [42] focus on using the HBM on KNL for joins and find that the mode where HBM is directly addressed as opposed to the cache-mode results in the highest performance.…”

Section: Stochastic Gradient Descent (Sgd)mentioning

confidence: 99%

High Bandwidth Memory on FPGAs: A Data Analytics Perspective

Kara

Hagleitner

Diamantopoulos

et al. 2020

2020 30th International Conference on Field-Programmable Logic and Applications (FPL)

View full text Add to dashboard Cite

FPGA-based data processing in datacenters is increasing in popularity due to the demands of modern workloads and the ensuing necessity for specialization in hardware. Driven by this trend, vendors are rapidly adapting reconfigurable devices to suit data and compute intensive workloads. Inclusion of High Bandwidth Memory (HBM) in FPGA devices is a recent example. HBM promises overcoming the bandwidth bottleneck, faced often by FPGA-based accelerators due to their throughput oriented design. In this paper, we study the usage and benefits of HBM on FPGAs from a data analytics perspective. We consider three workloads that are often performed in analytics oriented databases and implement them on FPGA showing in which cases they benefit from HBM: range selection, hash join, and stochastic gradient descent for linear model training. We integrate our designs into a columnar database (MonetDB) and show the tradeoffs arising from the integration related to data movement and partitioning. In certain cases, FPGA+HBM based solutions are able to surpass the highest performance provided by either a 2-socket POWER9 system or a 14-core XeonE5 by up to 1.8x (selection), 12.9x (join), and 3.2x (SGD).

show abstract

Deploying Hash Tables on Die-Stacked High Bandwidth Memory

Cited by 7 publications

References 19 publications

Shuhai: Benchmarking High Bandwidth Memory On FPGAS

Shuhai: Benchmarking High Bandwidth Memory On FPGAS

A High-Performance Non-Indexed Text Search System

High Bandwidth Memory on FPGAs: A Data Analytics Perspective

Contact Info

Product

Resources

About