Accelerating Multigrid-based Hierarchical Scientific Data Refactoring on GPUs

Chen, Jieyang; Wan, Lipeng; Liang, Xin; Whitney, Ben; Liu, Qing; Pugmire, David; Thompson, Nicholas; Choi, Jong Youl; Wolf, Matthew; Munson, Todd; Foster, Ian; Klasky, Scott

doi:10.1109/ipdps49936.2021.00095

Cited by 13 publications

(8 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared with cuSZ and cuMGARD, cuZFP provides slightly higher compression throughput, but it only supports fixedrate mode [19], limiting its adoption in practice. Both cuSZ and cuMGARD use Huffman encoding to achieve high compression ratios and their decompression throughput is greatly limited by slow Huffman decoding on GPUs, but cuSZ has a much higher throughput than cuMGARD [38,5]. Thus, in this work, we focus on optimizing Huffman decoding for cuSZ.…”

Section: Error-bounded Lossy Compression On Gpumentioning

confidence: 99%

“…SZ, ZFP, and MGARD were first developed for CPU architectures, and all started rolling out their GPU-based lossy compression recently. The SZ team, the ZFP team, and the MGARD team released their CUDA versions, called cuSZ [39], cuZFP [7], and cuMGARD [5], respectively. All the versions provide much higher throughputs for compression and decompression compared with their CPU versions [39,19,38].…”

Section: Error-bounded Lossy Compression On Gpumentioning

confidence: 99%

“…These GPU versions can both accelerate the compression computation and reduce the time needed to transfer the data between the GPU and CPU after the compression. For example, both SZ and MGARD [1] have a GPU adaptation (known as cuSZ [39] and cuMGARD [5]), which have been implemented for GPU hardware, and more specifically Nvidia's CUDA platform.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs

Rivera¹,

Di²,

Tian³

et al. 2022

Preprint

View full text Add to dashboard Cite

More and more HPC applications require fast and effective compression techniques to handle large volumes of data in storage and transmission. Not only do these applications need to compress the data effectively during simulation, but they also need to perform decompression efficiently for post hoc analysis. SZ is an error-bounded lossy compressor for scientific data, and cuSZ is a version of SZ designed to take advantage of the GPU's power. At present, cuSZ's compression performance has been optimized significantly while its decompression still suffers considerably lower performance because of its sophisticated lossless compression step-a customized Huffman decoding. In this work, we aim to significantly improve the Huffman decoding performance for cuSZ, thus improving the overall decompression performance in turn. To this end, we first investigate two state-ofthe-art GPU Huffman decoders in depth. Then, we propose a deep architectural optimization for both algorithms. Specifically, we take full advantage of CUDA GPU architectures by using shared memory on decoding/writing phases, online tuning the amount of shared memory to use, improving memory access patterns, and reducing warp divergence. Finally, we evaluate our optimized decoders on an Nvidia V100 GPU using eight representative scientific datasets. Our new decoding solution obtains an average speedup of 3.64× over cuSZ's Huffman decoder and improves its overall decompression performance by 2.43× on average.

show abstract

Section: Error-bounded Lossy Compression On Gpumentioning

confidence: 99%

Section: Error-bounded Lossy Compression On Gpumentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs

Rivera¹,

Di²,

Tian³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…As the type of processor that contributes the most of the computing parallelism in many current and future HPC systems, Graphics Processing Units (GPUs), equipped with thousands of low-power cores, offer high computational power and energy efficiency. Many applications and libraries have been designed and optimized for GPU accelerators [1,3,8,9,13,25,34,36,42,43]. Benefiting from the fact that GPUs are designed for highly parallelizable computations while CPUs are more efficient with serial computations, CPUs and GPUs that are linked through fast interconnections [30,31] are usually used together to form heterogeneous systems that can efficiently handle a large spectrum of scientific computing workloads.…”

Section: Introductionmentioning

confidence: 99%

Improving Energy Saving of One-sided Matrix Decompositions on CPU-GPU Heterogeneous Systems

Chen,

Liang,

Zhao

et al. 2023

Preprint

View full text Add to dashboard Cite

One-sided dense matrix decompositions (e.g., Cholesky, LU, and QR) are the key components in scientific computing in many different fields. Although their design has been highly optimized for modern processors, they still consume a considerable amount of energy. As CPU-GPU heterogeneous systems are commonly used for matrix decompositions, in this work, we aim to further improve the energy saving of onesided matrix decompositions on CPU-GPU heterogeneous systems. We first build an Algorithm-Based Fault Tolerance protected overclocking technique (ABFT-OC) to enable us to exploit reliable overclocking for key matrix decomposition operations. Then, we design an energy-saving matrix decomposition framework, Bi-directional Slack Reclamation (BSR), that can intelligently combine the capability provided by ABFT-OC and DVFS to maximize energy saving and maintain performance and reliability. Experiments show that BSR is able to save up to 11.7% more energy compared with the current best energy saving optimization approach with no performance degradation and up to 14.1% 𝐸𝑛𝑒𝑟𝑔𝑦×𝐷𝑒𝑙𝑎𝑦 2 reduction. Also, BSR enables the Pareto efficient performanceenergy trade-off, which is able to provide up to 1.43× performance improvement without costing extra energy.CCS Concepts: • Hardware → Power and energy; • Computer systems organization → Dependable and fault-tolerant systems and networks.

show abstract

“…Limitations of state-of-the-art approaches. Existing error-bounded lossy compressors for GPUs (such as cuSZ [14], cuZFP [15], and MGARD-GPU [16]) suffer from either low throughputs or low compression ratios. Specifically, although cuZFP has slightly higher throughput compared with cuSZ and MGARD-GPU, it supports only the fixed-rate mode [17], which suffers much lower compression quality than the fixed-accuracy mode (a.k.a error-bounded mode) [18], significantly limiting its adoption in practice.…”

Section: Introductionmentioning

confidence: 99%

FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs

Zhang¹,

Tian²,

Di³

et al. 2023

Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing

View full text Add to dashboard Cite

Today's large-scale scientific applications running on high-performance computing (HPC) systems generate vast data volumes. Thus, data compression is becoming a critical technique to mitigate the storage burden and data-movement cost. However, existing lossy compressors for scientific data cannot achieve a high compression ratio and throughput simultaneously, hindering their adoption in many applications requiring fast compression, such as in-memory compression. To this end, in this work, we develop a fast and highratio error-bounded lossy compressor on GPUs for scientific data (called FZ-GPU). Specifically, we first design a new compression pipeline that consists of fully parallelized quantization, bitshuffle, and our newly designed fast encoding. Then, we propose a series of deep architectural optimizations for each kernel in the pipeline to take full advantage of CUDA architectures. We propose a warplevel optimization to avoid data conflicts for bit-wise operations in bitshuffle, maximize shared memory utilization, and eliminate unnecessary data movements by fusing different compression kernels. Finally, we evaluate FZ-GPU on two NVIDIA GPUs (i.e., A100 and RTX A4000) using six representative scientific datasets from SDRBench. Results on the A100 GPU show that FZ-GPU achieves an average speedup of 4.2× over cuSZ and an average speedup of 37.0× over a multi-threaded CPU implementation of our algorithm under the same error bound. FZ-GPU also achieves an average speedup of 2.3× and an average compression ratio improvement of 2.0× over cuZFP under the same data distortion. CCS CONCEPTS• Theory of computation → Massively parallel algorithms; Data compression.

show abstract

Accelerating Multigrid-based Hierarchical Scientific Data Refactoring on GPUs

Cited by 13 publications

References 28 publications

Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs

Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs

Improving Energy Saving of One-sided Matrix Decompositions on CPU-GPU Heterogeneous Systems

FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs

Contact Info

Product

Resources

About