GPU-Initiated On-Demand High-Throughput Storage Access in the BaM System Architecture

Qureshi, Zaid; Mailthody, Vikram Sharma; Gelado, Isaac; Min, Seungwon; Masood, Amna; Park, Jeong-Min; Xiong, Jinjun; Newburn, Chris J.; Vainbrand, Dmitri; Chung, I-Hsin; Garland, Michael; Dally, William J.; Hwu, Wen-mei W.

doi:10.1145/3575693.3575748

Cited by 8 publications

(4 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As mentioned in Section 1, the host DRAM-based method EMOGI [31] is generally faster than the SSD-based method BaM [33] in terms of graph processing time. If EMOGI's runtime includes the time for loading graph data onto the host DRAM from the SSDs (data loading time) in addition to the time for running the algorithm on the GPU (graph processing time), BaM is shown to be competitive for some benchmark workloads.…”

Section: Processing Time As Performance Metricmentioning

confidence: 99%

“…With 16 drives, the system well supports the required random read speed of 93.75 MIOPS. To evaluate BaM, we replace XLFDDs with NVMe SSDs that collectively offer 6-MIOPS random read performance to match the number used in [33]. As with BaM, we place submission queues (SQs) and data buffers in the base address register (BAR) section of the GPU memory in order to control storage devices directly from the GPU.…”

Section: Evaluation On Low-latency Flash Memorymentioning

confidence: 99%

“…This paper has shown that EMOGI stays as performant even if the external memory latency is longer than the host DRAM, up to a few microseconds. GPU graph processing on storage: BaM introduced a first GPUcentric storage access method that does not involve CPU intervention [33]. While there are several prior works in GPU-centric approaches [28,37,39,40,43], they rely on the CPU to handle storage access and use the GPU memory as a staging buffer for their data transfer.…”

Section: Related Workmentioning

confidence: 99%

“…In order to handle ever-growing data sizes in these applications beyond the relatively limited capacity (tens of GBs) of GPU onboard memory, the use of external memory such as the host DRAM and solid-state drives (SSDs) can be a cost-effective approach compared with pooling multiple GPUs' memory together [9-11, 18, 22, 28, 31, 33, 37, 39, 40, 43]. In particular, GPU-centric external memory access methods have been shown to yield the stateof-the-art runtime performance in workloads involving on-demand, fine-grained random access such as graph analytics [31,33]. That is, when small pieces of data to be read next depend on the current processing results and cannot be a priori determined, it is more efficient to have the GPU initiate data requests than to have the CPU control the data flow between the GPU and external memory.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

GPU Graph Processing on CXL-Based Microsecond-Latency External Memory

Sano,

Bando,

Hiwada

et al. 2023

Proceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analys

View full text Add to dashboard Cite

In GPU graph analytics, the use of external memory such as the host DRAM and solid-state drives is a cost-effective approach to processing large graphs beyond the capacity of the GPU onboard memory. This paper studies the use of Compute Express Link (CXL) memory as alternative external memory for GPU graph processing in order to see if this emerging memory expansion technology enables graph processing that is as fast as using the host DRAM. Through analysis and evaluation using FPGA prototypes, we show that representative GPU graph traversal algorithms involving finegrained random access can tolerate an external memory latency of up to a few microseconds introduced by the CXL interface as well as by the underlying memory devices. This insight indicates that microsecond-latency flash memory may be used as CXL memory devices to realize even more cost-effective GPU graph processing while still achieving performance close to using the host DRAM. CCS CONCEPTS• Hardware → Analysis and design of emerging devices and systems; Memory and dense storage.

show abstract

Section: Processing Time As Performance Metricmentioning

confidence: 99%

Section: Evaluation On Low-latency Flash Memorymentioning

confidence: 99%