BurstZ+: Eliminating The Communication Bottleneck of Scientific Computing Accelerators via Accelerated Compression

Sun, Gongjin; Kang, Seongyoung; Jun, Sang-Woo

doi:10.1145/3476831

Cited by 7 publications

(3 citation statements)

References 63 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sun et al [21], to appear in 2022, tackle the same bandwidth problem as ours by using a mix of compression and data layout. Although this approach does not rely on polyhedral dependency analysis, it features the same base idea: group together data that is being used together.…”

Section: Column Major Row Major Data Tiling + Row Majormentioning

confidence: 99%

Increasing FPGA Accelerators Memory Bandwidth with a Burst-Friendly Memory Layout

Ferry¹,

Yuki²,

Derrien³

et al. 2022

Preprint

View full text Add to dashboard Cite

Offloading compute-intensive kernels to hardware accelerators relies on the large degree of parallelism offered by these platforms. However, the effective bandwidth of the memory interface often causes a bottleneck, hindering the accelerator's effective performance. Techniques enabling data reuse, such as tiling, lower the pressure on memory traffic but still often leave the accelerators I/O-bound. A further increase in effective bandwidth is possible by using burst rather than element-wise accesses, provided the data is contiguous in memory.In this paper, we propose a memory allocation technique, and provide a proof-of-concept source-to-source compiler pass, that enables such burst transfers by modifying the data layout in external memory. We assess how this technique pushes up the memory throughput, leaving room for exploiting additional parallelism, for a minimal logic overhead.

show abstract

Section: Column Major Row Major Data Tiling + Row Majormentioning

confidence: 99%

Increasing FPGA Accelerators Memory Bandwidth with a Burst-Friendly Memory Layout

Ferry¹,

Yuki²,

Derrien³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Nonetheless, tiling in the on-chip memory of accelerators is not mentioned. Sun et al [28], [29] presented compression-based approaches to address the performance bottleneck imposed by data transfer during executing stencil tasks in accelerator-based (e.g., CPU-GPU and CPU-FPGA) systems. Their work can be leveraged in combination with ours to further enhance the performance.…”

Section: Related Workmentioning

confidence: 99%

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

et al. 2023

View full text Add to dashboard Cite

Stencil computation is an extensively-utilized class of scientific-computing applications that can be efficiently accelerated by graphics processing units (GPUs). Out-of-core approaches enable a GPU to handle large stencil codes whose data size is beyond the memory capacity of the GPU. However, current research on out-of-core stencil computation primarily focus on minimizing the amount of data transferred between the CPU and GPU. Few studies consider simultaneously optimizing data transfer and kernel execution. To fill the research gap, this work presents a synergy between on-and off-chip data reuse for out-of-core stencil codes, termed SO2DR. First, overlapping regions between data chunks are shared in the off-chip memory to eliminate redundant CPU-GPU data transfer. Secondly, redundant computation at the off-chip memory level is intentionally introduced to decouple kernel execution from region sharing, hence enabling data reuse in the onchip memory. Experimental results demonstrate that SO2DR significantly enhances the kernel-execution performance while reducing the CPU-GPU data-transfer time. Specifically, SO2DR achieves average speedups of 2.78× and 1.14× for five stencil benchmarks, compared to an out-of-core stencil code which is free of redundant transfer and computation, and an in-core stencil code which is free of data transfer, respectively.

show abstract

“…Sun et al [19] proposed an accelerator platform that eliminates the data movement bottleneck between PCIe-attached FPGAs and their host servers via compression. Their approach mainly focuses on optimizing the ZFP compression algorithm [20] on a hardware (i.e.…”

Section: Related Workmentioning

confidence: 99%

Compression-Based Optimizations for Out-of-Core GPU Stencil Computation

Shen¹,

Deng²,

Wu³

et al. 2022

Preprint

View full text Add to dashboard Cite

An out-of-core stencil computation code handles large data whose size is beyond the capacity of GPU memory. Whereas, such an code requires streaming data to and from the GPU frequently. As a result, data movement between the CPU and GPU usually limits the performance. In this work, compression-based optimizations are proposed. First, an on-the-fly compression technique is applied to an out-of-core stencil code, reducing the CPU-GPU memory copy. Secondly, a single working buffer technique is used to reduce GPU memory consumption. Experimental results show that the stencil code using the proposed techniques achieved 1.1× speed and reduced GPU memory consumption by 33.0% on an NVIDIA Tesla V100 GPU.

show abstract

BurstZ+: Eliminating The Communication Bottleneck of Scientific Computing Accelerators via Accelerated Compression

Cited by 7 publications

References 63 publications

Increasing FPGA Accelerators Memory Bandwidth with a Burst-Friendly Memory Layout

Increasing FPGA Accelerators Memory Bandwidth with a Burst-Friendly Memory Layout

A compression-based memory-efficient optimization for out-of-core GPU stencil computation

Compression-Based Optimizations for Out-of-Core GPU Stencil Computation

Contact Info

Product

Resources

About