DKM: Differentiable K-Means Clustering Layer for Neural Network Compression

Cho, Minsik; Alizadeh-Vahid, Keivan; Adya, Saurabh; Rastegari, Mohammad

doi:10.48550/arxiv.2108.12659

Cited by 4 publications

(4 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, it is currently (2023) very difficult to attain Hopper GPUs. Moreover, while int8 and int4 are used for inference [16,64,15], and there is earlier work exploring 8 bit training for convnets [59,77,10], these formats are not commonly used for training transformer models at scale. The CLIP ViT-Huge models we train have 1B parameters including the image and text towers which is 40x larger than a standard ResNet-50 (23M) [27], and quantization is more challenging for large tensors [16].…”

Section: Preliminaries and Related Workmentioning

confidence: 99%

Robust fine-tuning of zero-shot models

Wortsman

Ilharco

Kim

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

257

161

View full text Add to dashboard Cite

We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) Towards accelerating training, we introduce Switch-Back, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP ViT-Huge-the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. While SwitchBack proves effective for float8, we show that standard techniques are also successful if the network is trained and initialized so that large feature magnitudes are discouraged, which we accomplish via layer-scale initialized with zeros. 2) Towards stable training, we analyze loss spikes and find they consistently occur 1-8 iterations after the squared gradients become underestimated by their AdamW second moment estimator. As a result, we recommend an AdamW-Adafactor hybrid, which we refer to as StableAdamW because it avoids loss spikes when training a CLIP ViT-Huge model and outperforms gradient clipping.

show abstract

Section: Preliminaries and Related Workmentioning

confidence: 99%

Robust fine-tuning of zero-shot models

Wortsman

Ilharco

Kim

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

257

161

View full text Add to dashboard Cite

show abstract

“…The proposed approximate mixed-signal computing unit offers the advantage of resiliency to PVT changes and performs distance computation with high throughput and low energy, thereby resulting in an energy-efficient design that can potentially be utilized for low-power signal processing tasks. Finally, we envision that the limited scope of application based on the proposed approximate K-means clustering system could be resolved through future works targeting interdisciplinary studies on neural networks and K-means clustering, as shown in recent research [30,31].…”

Section: Discussionmentioning

confidence: 99%

Energy Efficient Distance Computing: Application to K-Means Clustering

et al. 2022

View full text Add to dashboard Cite

Distance computation between two input vectors is a widely used computing unit in several pattern recognition, signal processing and neuromorphic applications. However, the implementation of such a functionality in conventional CMOS design requires expensive hardware and involves significant power consumption. Even power-efficient current-mode analog designs have proved to be slower and vulnerable to variations. In this paper, we propose an approximate mixed-signal design for the distance computing core by noting the fact that a vast majority of the signal processing applications involving this operation are resilient to small approximations in the distance computation. The proposed mixed-signal design is able to interface with external digital CMOS logic and simultaneously exhibit fast operating speeds. Another important feature of the proposed design is that the computing core is able to compute two variants of the distance metric, namely the (i) Euclidean distance squared (L22 norm) and (ii) Manhattan distance (L1 norm). The performance of the proposed design was evaluated on a standard K-means clustering algorithm on the “Iris flower dataset”. The results indicate a throughput of 6 ns per classification and ∼2.3× lower energy consumption in comparison to a synthesized digital CMOS design in commercial 45 nm CMOS technology.

show abstract

“…The lookup table is then adjusted using the cumulative gradient values for the weights that correspond to the same centroid. The original weights are also updated by using a straight-through estimator to overwrite the non-differential structure of clustering with an identity function, which allows all upstream gradients to be used in the updated original non-clustered weights of the layer [73], [74].…”

Section: B Weights Clusteringmentioning

confidence: 99%

Reducing Computational Complexity of Neural Networks in Optical Channel Equalization: From Concepts to Implementation

Freire

Napoli

Costa

et al. 2023

J. Lightwave Technol.

View full text Add to dashboard Cite

In this paper, a new methodology is proposed that allows for the low-complexity development of neural network (NN) based equalizers for the mitigation of impairments in highspeed coherent optical transmission systems. In this work, we provide a comprehensive description and comparison of various deep model compression approaches that have been applied to feed-forward and recurrent NN designs. Additionally, we evaluate the influence these strategies have on the performance of each NN equalizer. Quantization, weight clustering, pruning, and other cutting-edge strategies for model compression are taken into consideration. In this work, we propose and evaluate a Bayesian optimization-assisted compression, in which the hyperparameters of the compression are chosen to simultaneously reduce complexity and improve performance. Next, this paper presents four distinct metrics (RMpS, BoP, NABS, and NLGs) that are discussed here that can be used to evaluate the amount of computing complexity required by various compression algorithms. These measurements can serve as a benchmark for evaluating the relative effectiveness of various NN equalizers when compression approaches are used. In conclusion, the trade-off between the complexity of each compression approach and its performance is evaluated by utilizing both simulated and experimental data in order to complete the analysis. By utilizing optimal compression approaches, we show that it is possible to design an NN-based equalizer that is simpler to implement and has better performance than the conventional digital back-propagation (DBP) equalizer with only one step per span. This is accomplished by reducing the number of multipliers used in the NN equalizer after applying the weighted clustering and pruning algorithms. Furthermore, we demonstrate that an equalizer based on NN can also achieve superior performance while still maintaining the same degree of complexity as the full electronic chromatic dispersion compensation block. We conclude our analysis by highlighting open questions and existing challenges, as well as possible future research directions.

show abstract

DKM: Differentiable K-Means Clustering Layer for Neural Network Compression

Cited by 4 publications

References 19 publications

Robust fine-tuning of zero-shot models

Robust fine-tuning of zero-shot models

Energy Efficient Distance Computing: Application to K-Means Clustering

Reducing Computational Complexity of Neural Networks in Optical Channel Equalization: From Concepts to Implementation

Contact Info

Product

Resources

About