ZFP-V: Hardware-Optimized Lossy Floating Point Compression

Sun, Gongjin; Jun, Sang-Woo

doi:10.1109/icfpt47387.2019.00022

Cited by 6 publications

(10 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is still not enough to saturate the PCIe link in compressed form. In a previous work, we have explored optimized implementations of the ZFP algorithm on an Arria 10 FPGA [60], and arrived at similar results: Achieving 1-2 GB/s of bandwidth while consuming 30% of chip space. In the same work, we introduced some algorithmic optimizations to the ZFP algorithm which almost doubled the performance while maintaining similar compression efficiency and on-chip resource utilization.…”

Section: Performance Overhead and Accelerationmentioning

confidence: 70%

“…In order to overcome this limitation while maintaining high compression efficiency, we implement a 2-layer header structure. This is a new algorithmic improvement over the previously published 2D ZFP-V algorithm [60]. We observe from Table 2, only the bit planes whose MSB is 0 need a 1-bit header and all others need a 3-bit header.…”

Section: Variable Length Headers In Zfp-v2mentioning

confidence: 80%

“…When using the simpler ZFP-V1 cores, the BurstZ+ platform consumes about 24% of the on-chip resources of our prototype platform, and less than 3% of the on-chip resources of a modern, high-end FPGA such as the Virtex Ultrascale+. We also present the resource utilization of our best effort unmodified ZFP accelerator implementation, the performance of which we will present in Section 6 in relation to ZFP-V. We note that the resource utilization of the single unmodified ZFP accelerator pipeline is comparable to the published resource utilization numbers of an unmodified SZ accelerator pipeline [68], as well as the best-effort OpenCL implementation of ZFP on an Arria 10 FPGA [60]. Besides LUTs, the BurstZ+ platform consumes less than 500 KB of on-chip Block RAM resources, leaving the majority of on-chip memory resources to the computation engine.…”

Section: Implementation Detailsmentioning

confidence: 92%

“…BurstZ+ improves the previously published BurstZ platform by expanding the compression library with a more efficient 2D ZFP-V variant with its own set of optimizations. Furthermore, the version of ZFP-V2 presented here makes several improvements over the previously published version of the 2D ZFP-V algorithm [60] including an RTL implementation of a two-level header, which enables more efficient resource utilization as well as both higher average and worst-case performance per pipeline. We demonstrate that these improvements enable enhanced performance and scalability.…”

Section: Contributionsmentioning

confidence: 99%

See 3 more Smart Citations

BurstZ+: Eliminating The Communication Bottleneck of Scientific Computing Accelerators via Accelerated Compression

Sun

Kang

Jun

2022

ACM Trans. Reconfigurable Technol. Syst.

Self Cite

View full text Add to dashboard Cite

We present BurstZ+, an accelerator platform that eliminates the communication bottleneck between PCIe-attached scientific computing accelerators and their host servers, via hardware-optimized compression. While accelerators such as GPUs and FPGAs provide enormous computing capabilities, their effectiveness quickly deteriorates once data is larger than its on-board memory capacity, and performance becomes limited by the communication bandwidth of moving data between the host memory and accelerator. Compression has not been very useful in solving this issue due to performance and efficiency issues of compressing floating point numbers, which scientific data often consists of. BurstZ+ is an FPGA-based prototype accelerator platform which addresses the bandwidth issue via a class of novel hardware-optimized floating point compression algorithm called ZFP-V. We demonstrate that BurstZ+ can completely remove the host-side communication bottleneck for accelerators, using multiple stencil kernels with a wide range of operational intensities. Evaluated against hand-optimized implementations of kernel accelerators of the same architecture, our single-pipeline BurstZ+ prototype outperforms an accelerator without compression by almost 4×, and even an accelerator with enough memory for the entire dataset by over 2×. Furthermore, the projected performance of BurstZ+ on a future, faster FPGA scales to almost 7× that of the same accelerator without compression, whose performance is still limited by the PCIe bandwidth.

show abstract

Section: Performance Overhead and Accelerationmentioning

confidence: 70%

Section: Variable Length Headers In Zfp-v2mentioning

confidence: 80%

Section: Implementation Detailsmentioning

confidence: 92%

Section: Contributionsmentioning

confidence: 99%

See 2 more Smart Citations

BurstZ+: Eliminating The Communication Bottleneck of Scientific Computing Accelerators via Accelerated Compression

Sun

Kang

Jun

2022

ACM Trans. Reconfigurable Technol. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, the compression ratio is still limited between 2:1 and 4:1 despite the loss of precision as these approaches do not exploit inter-value similarities to compress data. Closer to MemSZ, software techniques for lossy compression have been proposed, but have high complexity and latency and as a consequence cannot be used directly for memory compression [11,61].…”

Section: Related Workmentioning

confidence: 99%

MemSZ

Eldstål-Ahrens

Sourdis

2020

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

This article describes Memory Squeeze (MemSZ), a new approach for lossy general-purpose memory compression. MemSZ introduces a low latency, parallel design of the Squeeze (SZ) algorithm offering aggressive compression ratios, up to 16:1 in our implementation. Our compressor is placed between the memory controller and the cache hierarchy of a processor to reduce the memory traffic of applications that tolerate approximations in parts of their data. Thereby, the available off-chip bandwidth is utilized more efficiently improving system performance and energy efficiency. Two alternative multi-core variants of the MemSZ system are described. The first variant has a shared last-level cache (LLC) on the processor-die, which is modified to store both compressed and uncompressed data. The second has a 3D-stacked DRAM cache with larger cache lines that match the granularity of the compressed memory blocks and stores only uncompressed data. For applications that tolerate aggressive approximation in large fractions of their data, MemSZ reduces baseline memory traffic by up to 81%, execution time by up to 62%, and energy costs by up to 25% introducing up to 1.8% error to the application output. Compared to the current state-of-the-art lossy memory compression design, MemSZ improves the execution time, energy, and memory traffic by up to 15%, 9%, and 64%, respectively.

show abstract

ZHW: A Numerical CODEC for Big Data Scientific Computation

Barrow

Lloyd

et al. 2022

2022 International Conference on Field-Programmable Technology (ICFPT)

View full text Add to dashboard Cite

ZFP-V: Hardware-Optimized Lossy Floating Point Compression

Cited by 6 publications

References 17 publications

BurstZ+: Eliminating The Communication Bottleneck of Scientific Computing Accelerators via Accelerated Compression

BurstZ+: Eliminating The Communication Bottleneck of Scientific Computing Accelerators via Accelerated Compression

MemSZ

ZHW: A Numerical CODEC for Big Data Scientific Computation

Contact Info

Product

Resources

About