Online Ensemble Model Compression Using Knowledge Distillation

Walawalkar, Devesh; Shen, Zhiqiang; Savvides, Marios

doi:10.1007/978-3-030-58529-7_2

Cited by 41 publications

(40 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…To demonstrate the usage of distillation metric, we use the results reported in Walawalkar, Shen & Savvides (2020) using CIFAR100 dataset Krizhevsky (2009) and the Resnet44 architecture He et al (2016) . In their experiment, they trained four student models having relative sizes of 62.84%, 35.36%, 15.25% and 3.74% as compared to the teacher model.…”

Section: Distillation Metricmentioning

confidence: 99%

“… Walawalkar, Shen & Savvides (2020) proposed to train an ensemble of models that is broken down into four blocks, where all models share the first block of layers. The first model in the ensemble is considered the teacher (termed pseudo teacher in the paper).…”

Section: Surveymentioning

confidence: 99%

“…After training, any model in the ensemble can be selected to be deployed or the whole ensemble with the fusion classifier can be deployed in case of lenient hardware constraints. Walawalkar, Shen & Savvides (2020) proposed to train an ensemble of models that is broken down into four blocks, where all models share the first block of layers. The first model in the ensemble is considered the teacher (termed pseudo teacher in the paper).…”

Section: Online Distillationmentioning

confidence: 99%

See 2 more Smart Citations

Knowledge distillation in deep learning and its applications

Alkhulaifi¹,

Alsahli²,

Ahmad³

2021

PeerJ Computer Science

View full text Add to dashboard Cite

Deep learning based models are relatively large, and it is hard to deploy such models on resource-limited devices such as mobile phones and embedded devices. One possible solution is knowledge distillation whereby a smaller model (student model) is trained by utilizing the information from a larger model (teacher model). In this paper, we present an outlook of knowledge distillation techniques applied to deep learning models. To compare the performances of different techniques, we propose a new metric called distillation metric which compares different knowledge distillation solutions based on models' sizes and accuracy scores. Based on the survey, some interesting conclusions are drawn and presented in this paper including the current challenges and possible research directions.

show abstract

Section: Distillation Metricmentioning

confidence: 99%

Section: Surveymentioning

confidence: 99%

Section: Online Distillationmentioning

confidence: 99%

See 1 more Smart Citation

Knowledge distillation in deep learning and its applications

Alkhulaifi¹,

Alsahli²,

Ahmad³

2021

PeerJ Computer Science

View full text Add to dashboard Cite

show abstract

“…Thus, finding a proper teacher-student architecture in offline distillation is challenging. In contrast, online distillation provides a one-phase end-to-end training scheme via teacher-student collaborative learning on a peer-network architecture instead of a fixed one [25,33,36,32,28,12].…”

Section: Introductionmentioning

confidence: 99%

Collaborative Teacher-Student Learning via Multiple Knowledge Transfer

Sun¹,

Gou²,

Yu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Knowledge distillation (KD), as an efficient and effective model compression technique, has been receiving considerable attention in deep learning. The key to its success is to transfer knowledge from a large teacher network to a small student one. However, most of the existing knowledge distillation methods consider only one type of knowledge learned from either instance features or instance relations via a specific distillation strategy in teacher-student learning. There are few works that explore the idea of transferring different types of knowledge with different distillation strategies in a unified framework. Moreover, the frequently used offline distillation suffers from a limited learning capacity due to the fixed teacher-student architecture. In this paper we propose a collaborative teacher-student learning via multiple knowledge transfer (CTSL-MKT) that prompts both self-learning and collaborative learning. It allows multiple students learn knowledge from both individual instances and instance relations in a collaborative way. While learning from themselves with self-distillation, they can also guide each other via online distillation. The experiments and ablation studies on four image datasets demonstrate that the proposed CTSL-MKT significantly outperforms the state-of-the-art KD methods.

show abstract

“…受深度互学习 [17] 和在线集成模型压缩 [18] 方法启发, 为了使压缩模型中每一个压缩子模型之间互相增强学习, 本文将压缩子模型(去除需要学习更新权重的模型)的类别输出平均值作为监督信息, 同样采用该输出平均值的 Softmax 概率与需要更新学习权重的压缩子模型输出预测值 Softmax 概率的 KL 散度作为损失函数, 具体形式可以表示为 [20] 启发, 本文使用参数矩阵的 21  范数作为稀疏正则, 同时考虑卷积层输出通道对应当前层和下一层的参数 [21] . 于是, 本文采用的动态稀疏正则 [22] 可以表示为…”

unclassified