Knowledge Distillation Beyond Model Compression

Sarfraz, Fahad; Zonooz, Bahram

doi:10.1109/icpr48806.2021.9413016

Cited by 25 publications

(9 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Long-term memory builds structural representations for generalization and mimics the slow acquisition of structured knowledge in the neocortex, which can generalize well across tasks. The long-term memory then interacts with the instance-level episodic memory to retrieve structural relational knowledge (Sarfraz et al 2021) for the previous tasks encoded in the output logits. Consolidated logits are then utilized to enforce consistency in the functional space of the working model.…”

Section: Multiple Memory Systemsmentioning

confidence: 99%

Sparse Coding in a Dual Memory System for Lifelong Learning

Sarfraz¹,

Zonooz²

2023

Preprint

View full text Add to dashboard Cite

Efficient continual learning in humans is enabled by a rich set of neurophysiological mechanisms and interactions between multiple memory systems. The brain efficiently encodes information in non-overlapping sparse codes, which facilitates the learning of new associations faster with controlled interference with previous associations. To mimic sparse coding in DNNs, we enforce activation sparsity along with a dropout mechanism which encourages the model to activate similar units for semantically similar inputs and have less overlap with activation patterns of semantically dissimilar inputs. This provides us with an efficient mechanism for balancing the reusability and interference of features, depending on the similarity of classes across tasks. Furthermore, we employ sparse coding in a multiple-memory replay mechanism. Our method maintains an additional long-term semantic memory that aggregates and consolidates information encoded in the synaptic weights of the working model. Our extensive evaluation and characteristics analysis show that equipped with these biologically inspired mechanisms, the model can further mitigate forgetting 1 .

show abstract

Section: Multiple Memory Systemsmentioning

confidence: 99%

Sparse Coding in a Dual Memory System for Lifelong Learning

Sarfraz¹,

Zonooz²

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Roughly speaking, typical model compression methods can be classified into four categories. Weight pruning, quantization, knowledge distillation (Sarfraz et al, 2021;Walawalkar et al, 2020), and low-rank decomposition (Idelbayev & Carreira-Perpinán, 2020;Lin et al, 2018;Lee et al, 2019). Even though such methods strive to find a smaller model while retaining the model's accuracy, they often tend to neglect the potential inherent in the entropy limit.…”

Section: Related Workmentioning

confidence: 99%

Rotation Invariant Quantization for Model Compression

Kampeas¹,

Nahshan²,

Kremer³

et al. 2023

Preprint

View full text Add to dashboard Cite

Post-training Neural Network (NN) model compression is an attractive approach for deploying large, memory-consuming models on devices with limited memory resources. In this study, we investigate the rate-distortion tradeoff for NN model compression. First, we suggest a Rotation-Invariant Quantization (RIQ) technique that utilizes a single parameter to quantize the entire NN model, yielding a different rate at each layer, i.e., mixed-precision quantization. Then, we prove that our rotation-invariant approach is optimal in terms of compression. We rigorously evaluate RIQ and demonstrate its capabilities on various models and tasks. For example, RIQ facilitates ×19.4 and ×52.9 compression ratios on pretrained VGG dense and pruned models, respectively, with < 0.4% accuracy degradation. Code: https://github.com/ehaleva/RIQ.

show abstract

“…In a typical "teacher-student" knowledge distillation, smaller stu- In the original formulation, Hinton et al [20] proposed a representation distillation by way of mimicking softened softmax output of the teacher. Better generalization can be achieved by emulating the latent feature space in addition to mimicking the output of the teacher [37,39,45,36,33].…”

Section: Related Workmentioning

confidence: 99%

“…Online knowledge distillation offers a more attractive alternative owing to its one stage training and bidirectional knowledge distillation [47,15,26,37]. These approaches treat all (typically two) participating models equally, enabling them to learn from each other.…”

Section: Introductionmentioning

confidence: 99%

Distill on the Go: Online knowledge distillation in self-supervised learning

Bhat¹,

Zonooz²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Self-supervised learning solves pretext prediction tasks that do not require annotations to learn feature representations. For vision tasks, pretext tasks such as predicting rotation, solving jigsaw are solely created from the input data. Yet, predicting this known information helps in learning representations useful for downstream tasks. However, recent works have shown that wider and deeper models benefit more from self-supervised learning than smaller models. To address the issue of self-supervised pre-training of smaller models, we propose Distill-on-the-Go (DoGo), a self-supervised learning paradigm using single-stage online knowledge distillation to improve the representation quality of the smaller models. We employ deep mutual learning strategy in which two models collaboratively learn from each other to improve one another. Specifically, each model is trained using self-supervised learning along with distillation that aligns each model's softmax probabilities of similarity scores with that of the peer model. We conduct extensive experiments on multiple benchmark datasets, learning objectives, and architectures to demonstrate the potential of our proposed method. Our results show significant performance gain in the presence of noisy and limited labels, and in generalization to out-of-distribution data.

show abstract

Knowledge Distillation Beyond Model Compression

Cited by 25 publications

References 13 publications

Sparse Coding in a Dual Memory System for Lifelong Learning

Sparse Coding in a Dual Memory System for Lifelong Learning

Rotation Invariant Quantization for Model Compression

Distill on the Go: Online knowledge distillation in self-supervised learning

Contact Info

Product

Resources

About