2021
DOI: 10.48550/arxiv.2108.12659
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

DKM: Differentiable K-Means Clustering Layer for Neural Network Compression

Abstract: Deep neural network (DNN) model compression for efficient on-device inference is becoming increasingly important to reduce memory requirements and keep user data on-device. To this end, we propose a novel differentiable k-means clustering layer (DKM) and its application to train-time weight clustering based DNN model compression. DKM casts k-means clustering as an attention problem and enables joint optimization of the DNN parameters and clustering centroids. Unlike prior works that rely on additional regulari… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 19 publications
0
3
0
Order By: Relevance
“…However, it is currently (2023) very difficult to attain Hopper GPUs. Moreover, while int8 and int4 are used for inference [16,64,15], and there is earlier work exploring 8 bit training for convnets [59,77,10], these formats are not commonly used for training transformer models at scale. The CLIP ViT-Huge models we train have 1B parameters including the image and text towers which is 40x larger than a standard ResNet-50 (23M) [27], and quantization is more challenging for large tensors [16].…”
Section: Preliminaries and Related Workmentioning
confidence: 99%
“…However, it is currently (2023) very difficult to attain Hopper GPUs. Moreover, while int8 and int4 are used for inference [16,64,15], and there is earlier work exploring 8 bit training for convnets [59,77,10], these formats are not commonly used for training transformer models at scale. The CLIP ViT-Huge models we train have 1B parameters including the image and text towers which is 40x larger than a standard ResNet-50 (23M) [27], and quantization is more challenging for large tensors [16].…”
Section: Preliminaries and Related Workmentioning
confidence: 99%
“…The proposed approximate mixed-signal computing unit offers the advantage of resiliency to PVT changes and performs distance computation with high throughput and low energy, thereby resulting in an energy-efficient design that can potentially be utilized for low-power signal processing tasks. Finally, we envision that the limited scope of application based on the proposed approximate K-means clustering system could be resolved through future works targeting interdisciplinary studies on neural networks and K-means clustering, as shown in recent research [30,31].…”
Section: Discussionmentioning
confidence: 99%
“…The lookup table is then adjusted using the cumulative gradient values for the weights that correspond to the same centroid. The original weights are also updated by using a straight-through estimator to overwrite the non-differential structure of clustering with an identity function, which allows all upstream gradients to be used in the updated original non-clustered weights of the layer [73], [74].…”
Section: B Weights Clusteringmentioning
confidence: 99%