Network Quantization with Element-wise Gradient Scaling

Lee, Junghyup; Kim, Dohyung; Ham, Bumsub

doi:10.1109/cvpr46437.2021.00638

Cited by 78 publications

(77 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Model quantization: Besides clustering and regularization methods, model quantization can also reduce the model size, and training-time quantization techniques have been developed to improve the accuracy of quantized models. EWGS [9] adjusts gradients by scaling them up or down based on the Hessian approximation for each layer. PROFIT [12] adopts an iterative process and freezes layers based on the activation instability.…”

Section: Model Compression Using Regularizationmentioning

confidence: 99%

“…• Instead of hard weight-cluster assignment and approximated gradient [6,9,19,26], DKM uses flexible and differentiable attention-based weight clustering and computes gradients w.r.t the task loss without approximation.…”

Section: Differentiable K-means Clustering Layer For Weight-clusteringmentioning

confidence: 99%

“…• DKM requires no additional learnable parameters [9,24], thus making the training flow simple. For example, DKM-base approach does not need to substitute a convolution layer with a specialized version with additional learning parameters.…”

Section: Differentiable K-means Clustering Layer For Weight-clusteringmentioning

confidence: 99%

“…• DKM requires no additional computation such as Hessian trace [4] or SVD [9,27] for gradient approximation, because DKM uses a differentiable process.…”

Section: Differentiable K-means Clustering Layer For Weight-clusteringmentioning

confidence: 99%

“…With the multi-dimensional scheme, the effective bit-per-weight becomes b d for b-bit/d-dim clustering. Such multi-dimensional clustering can be detrimental to conventional methods (i.e., DNN training not converging) [9,19,26], as now a weight might be on the boundary to multiple centroids, and the chance of making wrong decisions grows exponentially with the number of centroids. For example, there are only two centroids for 1bit/1dim clustering, while there are 16 centroids in 4bit/4dim clustering, although both have the same effective bit-per-weight.…”

Section: Multi-dimensional Dkmmentioning

confidence: 99%

See 4 more Smart Citations

DKM: Differentiable K-Means Clustering Layer for Neural Network Compression

Cho¹,

Alizadeh-Vahid²,

Adya³

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep neural network (DNN) model compression for efficient on-device inference is becoming increasingly important to reduce memory requirements and keep user data on-device. To this end, we propose a novel differentiable k-means clustering layer (DKM) and its application to train-time weight clustering based DNN model compression. DKM casts k-means clustering as an attention problem and enables joint optimization of the DNN parameters and clustering centroids. Unlike prior works that rely on additional regularizers and parameters, DKM-based compression keeps the original loss function and model architecture fixed. We evaluated DKM-based compression on various DNN models for computer vision and natural language processing (NLP) tasks. Our results demonstrate that DKM delivers superior compression and accuracy trade-off on ImageNet1k and GLUE benchmarks. For example, DKM-based compression can offer 74.5% top-1 ImageNet1k accuracy on ResNet50 DNN model with 3.3MB model size (29.4x model compression factor). For MobileNet-v1, which is a challenging DNN to compress, DKM delivers 62.8% top-1 ImageNet1k accuracy with 0.74 MB model size (22.4x model compression factor). This result is 6.8% higher top-1 accuracy and 33% relatively smaller model size than the current state-of-the-art DNN compression algorithms. Additionally, DKM enables compression of DistilBERT model by 11.8x with minimal (1.1%) accuracy loss on GLUE NLP benchmarks. * equal contribution Preprint. Under review.

show abstract

Section: Model Compression Using Regularizationmentioning

confidence: 99%

Section: Differentiable K-means Clustering Layer For Weight-clusteringmentioning

confidence: 99%

Section: Differentiable K-means Clustering Layer For Weight-clusteringmentioning

confidence: 99%

“…• DKM requires no additional computation such as Hessian trace [4] or SVD [9,27] for gradient approximation, because DKM uses a differentiable process.…”

Section: Differentiable K-means Clustering Layer For Weight-clusteringmentioning

confidence: 99%

Section: Multi-dimensional Dkmmentioning

confidence: 99%

See 3 more Smart Citations