JPEG-ACT: Accelerating Deep Learning via Transform-based Lossy Compression

Evans, R.; Liu, Lufei; Aamodt, Tor M.

doi:10.1109/isca45697.2020.00075

Cited by 42 publications

(31 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that convolutional layers also dominate the computational time during training process, which benefits us for apply compression with low overhead. Similar to many previous studies [13,24,42,58], our research goal is to develop an efficient and generic strategy to achieve a high reduction in memory consumption for CNN training. Our work can increase the batch size limit and convergence speed or enable training on the hardware with lower memory capacity for the same CNN model.…”

Section: Research Goals and Challengesmentioning

confidence: 99%

“…Moreover, memory compression approaches based on lossless compression of activation data [49] suffer from the limited compression ratio (e.g., only around 2:1 for most floatingpoint data). Alternatively, recent works [6,13] proposed to develop compression offloading accelerators for reducing the activation data size before transferring it to the CPU DRAM. However, adding a new dedicated hardware component to the existing GPU architecture requires tremendous industry efforts and is not ready for immediate deployment.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A novel memory-efficient deep learning training framework via error-bounded lossy compression

Jin

Song

et al. 2021

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

Deep neural networks (DNNs) are becoming increasingly deeper, wider, and non-linear due to the growing demands on prediction accuracy and analysis quality. Training wide and deep neural networks require large amounts of storage resources such as memory because the intermediate activation data must be saved in the memory during forward propagation and then restored for backward propagation. However, state-of-the-art accelerators such as GPUs are only equipped with very limited memory capacities due to hardware design constraints, which significantly limits the maximum batch size and hence performance speedup when training large-scale DNNs. Traditional memory saving techniques either suffer from performance overhead or are constrained by limited interconnect bandwidth or specific interconnect technology.In this paper, we propose a novel memory-efficient CNN training framework (called COMET) that leverages error-bounded lossy compression to significantly reduce the memory requirement for training in order to allow training larger models or to accelerate training. Different from the state-of-the-art solutions that adopt image-based lossy compressors (such as JPEG) to compress the activation data, our framework purposely adopts error-bounded lossy compression with a strict error-controlling mechanism. Specifically, we perform a theoretical analysis on the compression error propagation from the altered activation data to the gradients, and empirically investigate the impact of altered gradients over the training process. Based on these analyses, we optimize the errorbounded lossy compression and propose an adaptive error-bound control scheme for activation data compression. We evaluate our design against state-of-the-art solutions with five widely-adopted CNNs and ImageNet dataset. Experiments demonstrate that our proposed framework can significantly reduce the training memory consumption by up to 13.5× over the baseline training and 1.8× over another state-of-the-art compression-based framework, respectively, with little or no accuracy loss.

show abstract

Section: Research Goals and Challengesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A novel memory-efficient deep learning training framework via error-bounded lossy compression

Jin

Song

et al. 2021

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

show abstract

“…[11][12][13][14][15] focus on designing hardware-friendly compression encoders and decompression decoders or reducing the calculation cost of sparse data by using a special storage format. [16][17][18][19] design algorithms to compress data transferred between GPU and CPU memory during training.…”

Section: Introductionmentioning

confidence: 99%

“…We observe the specific 2-D frequency domain information of activations, which demonstrates the fact that we can use 2D-DCT to compress feature maps in earlier layers of CNN. Then we combine Approximate Sparsity Preprocessing (ASP) and lowbit quantization to compress feature maps in the later layers, which achieves a compression ratio of 2.6× (+30% over [16]).…”

mentioning

confidence: 99%

See 1 more Smart Citation

Transform-Based Feature Map Compression for CNN Inference

Shi

Wang

Chen

et al. 2021

2021 IEEE International Symposium on Circuits and Systems (ISCAS)

View full text Add to dashboard Cite

To achieve higher accuracy in machine learning tasks, very deep convolutional neural networks (CNNs) are designed recently. However, the large memory access of deep CNNs will lead to high power consumption. A variety of hardware-friendly compression methods have been proposed to reduce the data transfer bandwidth by exploiting the sparsity of feature maps. Most of them focus on designing a specialized encoding format to increase the compression ratio. Differently, we observe and exploit the sparsity distinction between activations in earlier and later layers to improve the compression ratio. We propose a novel hardware-friendly transform-based method named 1D-Discrete Cosine Transform on Channel dimension with Masks (DCT-CM), which intelligently combines DCT, masks, and a coding format to compress activations. The proposed algorithm achieves an average compression ratio of 2.9× (53% higher than the stateof-the-art transform-based feature map compression works) during inference on ResNet-50 with an 8-bit quantization scheme.

show abstract