Lossless data compression is a promising software approach for reducing the bandwidth requirements of scientific applications on accelerator clusters without introducing approximation errors. Suitable compressors must be able to effectively compact floatingpoint data while saturating the system interconnect to avoid introducing unnecessary latencies.We present ndzip-gpu, a novel, highly-efficient GPU parallelization scheme for the block compressor ndzip, which has recently set a new milestone in CPU floating-point compression speeds.Through the combination of intra-block parallelism and efficient memory access patterns, ndzip-gpu achieves high resource utilization in decorrelating multi-dimensional data via the Integer Lorenzo Transform. We further introduce a novel, efficient warp-cooperative primitive for vertical bit packing, providing a high-throughput data reduction and expansion step.Using a representative set of scientific data, we compare the performance of ndzip-gpu against five other, existing GPU compressors. While observing that effectiveness of any compressor strongly depends on characteristics of the dataset, we demonstrate that ndzip-gpu offers the best average compression ratio for the examined data. On Nvidia Turing, Volta and Ampere hardware, it achieves the highest single-precision throughput by a significant margin while maintaining a favorable trade-off between data reduction and throughput in the double-precision case.
CCS CONCEPTS• Computing methodologies → Parallel algorithms; • Theory of computation → Data compression.