Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures

Tian, Jiannan; Rivera, Cody; Di, Sheng; Chen, Jieyang; Liang, Xin; Tao, Dingwen; Cappello, Franck

doi:10.1109/ipdps49936.2021.00097

Cited by 23 publications

(11 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…3) Test Datasets: We conduct our evaluation and comparison based on eight typical 1D∼4D real-world HPC simulation datasets, including six from Scientific Data Reduction Benchmarks [34]: 1D HACC cosmology simulation [12], 2D LAMMPS (part of the EXAALT ECP project) molecular dynamics simulation [24], 3D CESM-ATM climate simulation [6], 3D Nyx cosmology simulation [31], 4D Hurricane ISABEL simulation [16], and 4D QMCPack quantum simulation [32]. They have been widely used in much prior work [37,26,27,47,46,38,40,39,20,4] and are good representatives of production-level simulation datasets. Additionally, we also evaluate two datasets that highlight our decoders' potential to be used as in-memory compressors as discussed in §I, including 3D RTM simulation data for petroleum exploration [17] and 1D GAMESS data for quantum chemistry simulation [10].…”

Section: Performance Evaluationmentioning

confidence: 99%

“…For example, Lal et al proposed a Huffman-based entropy encoding system (E 2 MC) for GPUs [23]. More recently, Tian et al proposed a fast parallel Huffman codebook construction algorithm and a parallel Huffman encoder for modern GPU architectures [40]. Since much work has already been focused on optimizing Huffman encoding, we do not presently consider optimizing encoding in our work.…”

Section: Use-case Of Our Two Decodersmentioning

confidence: 99%

“…An important component of both cuSZ and cuMGARD is Huffman coding, a classic lossless compression technique initially developed by David Huffman in 1952 [14]. Tian et al's work proposes an optimized Huffman encoder for GPUs [40]; their work has been applied to improve cuSZ's compression throughput. Although efficient compression is important to speedup the overall data movement, efficient decompression is also important to enable fast and effective post-analysis based on compressed data.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs

Rivera¹,

Di²,

Tian³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

More and more HPC applications require fast and effective compression techniques to handle large volumes of data in storage and transmission. Not only do these applications need to compress the data effectively during simulation, but they also need to perform decompression efficiently for post hoc analysis. SZ is an error-bounded lossy compressor for scientific data, and cuSZ is a version of SZ designed to take advantage of the GPU's power. At present, cuSZ's compression performance has been optimized significantly while its decompression still suffers considerably lower performance because of its sophisticated lossless compression step-a customized Huffman decoding. In this work, we aim to significantly improve the Huffman decoding performance for cuSZ, thus improving the overall decompression performance in turn. To this end, we first investigate two state-ofthe-art GPU Huffman decoders in depth. Then, we propose a deep architectural optimization for both algorithms. Specifically, we take full advantage of CUDA GPU architectures by using shared memory on decoding/writing phases, online tuning the amount of shared memory to use, improving memory access patterns, and reducing warp divergence. Finally, we evaluate our optimized decoders on an Nvidia V100 GPU using eight representative scientific datasets. Our new decoding solution obtains an average speedup of 3.64× over cuSZ's Huffman decoder and improves its overall decompression performance by 2.43× on average.

show abstract

Section: Performance Evaluationmentioning

confidence: 99%

Section: Use-case Of Our Two Decodersmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs

Rivera¹,

Di²,

Tian³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Note that our design is different from other Huffman coding works in terms of adaptivity. For example, Tian et al [50] proposed a reduction-based scheme for GPUs that iteratively merges the encoded symbols and adaptively determines the number of merge iterations. However, CEAZ only builds a new codebook for the data chunk when the change of its histogram exceeds a threshold in order to target FPGA with limited resources and low clock frequency.…”

Section: Adaptive Online Codewords Updatesmentioning

confidence: 99%

Ceaz

Zhang

Jin

Geng

et al. 2022

Proceedings of the 36th ACM International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

As HPC systems continue to grow to exascale, the amount of data that needs to be saved or transmitted is exploding. To this end, many previous works have studied using error-bounded lossy compressors to reduce the data size and improve the I/O performance. However, little work has been done for effectively offloading lossy compression onto FPGA-based SmartNICs to reduce the compression overhead. In this paper, we propose a hardware-algorithm codesign for an efficient and adaptive lossy compressor for scientific data on FPGAs (called CEAZ), which is the first lossy compressor that can achieve high compression ratios and throughputs simultaneously. Specifically, we propose an efficient Huffman coding approach that can adaptively update Huffman codewords online based on codewords generated offline, from a variety of representative scientific datasets. Moreover, we derive a theoretical analysis to support a precise control of compression ratio under an errorbounded compression mode, enabling accurate offline Huffman codewords generation. This also helps us create a fixed-ratio compression mode for consistent throughput. In addition, we develop an efficient compression pipeline by adopting cuSZ's dual-quantization algorithm to our hardware use cases. Finally, we evaluate CEAZ on five real-world datasets with both a single FPGA board and 128 nodes (to accelerate parallel I/O). Experiments show that CEAZ outperforms the second-best FPGA-based lossy compressor by 2.3× of throughput and 3.0× of ratio. It also improves MPI_File_write and MPI_Gather throughputs by up to 28.9× and 37.8×, respectively. CCS CONCEPTS• Computer systems organization → Reconfigurable computing.

show abstract

“…Currently, several GPU-based error-controlled lossy compressors (such as CUSZ [13] and cuZFP [14]) have been developed, but they suffer from either sub-optimal compression throughput or low compression ratios. For instance, CUSZ can achieve much higher compression ratios than cuZFP, but its performance is substantially limited by the Huffman encoding stage and dictionary encoding step [15]. However, the high compression ratios of SZ/CUSZ significantly depend on Huffman encoding and dictionary encoding, because the output of the prediction-and-quantization step in SZ/CUSZ is often composed of a large amount of repeated symbols.…”

Section: Introductionmentioning

confidence: 99%

Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs

Tian¹,

Di²,

Yu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Error-bounded lossy compression is a critical technique for significantly reducing scientific data volumes. With ever-emerging heterogeneous HPC architecture, GPU-accelerated error-bounded compressors (such as CUSZ and cuZFP) have been developed. However, they suffer from either low performance or low compression ratios. To this end, we propose CUSZ(X) to target both high compression ratio and throughput. We identify that data sparsity and data smoothness are key factors for high compression throughput. Our key contributions in this work are fourfold: (1) We propose an efficient compression workflow to adaptively perform run-length encoding and/or variable-length encoding. (2) We derive Lorenzo reconstruction in decompression as multidimensional partial-sum computation and propose a fine-grained Lorenzo reconstruction algorithm for GPU architectures. (3) We carefully optimize each of CUSZ's kernels by leveraging state-of-the-art CUDA parallel primitives. (4) We evaluate CUSZ(X) using seven real-world HPC application datasets on V100 and A100 GPUs. Experiments show CUSZ(X) improves the compression performance and ratios by up to 18.4× and 5.3×, respectively, over CUSZ on the tested datasets.

show abstract

Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures

Cited by 23 publications

References 26 publications

Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs

Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs

Ceaz

Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs

Contact Info

Product

Resources

About