2020
DOI: 10.1007/978-3-030-58529-7_2
|View full text |Cite
|
Sign up to set email alerts
|

Online Ensemble Model Compression Using Knowledge Distillation

Abstract: This paper presents a novel knowledge distillation based model compression framework consisting of a student ensemble. It enables distillation of simultaneously learnt ensemble knowledge onto each of the compressed student models. Each model learns unique representations from the data distribution due to its distinct architecture. This helps the ensemble generalize better by combining every model's knowledge. The distilled students and ensemble teacher are trained simultaneously without requiring any pretraine… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
39
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 41 publications
(40 citation statements)
references
References 37 publications
0
39
0
1
Order By: Relevance
“…To demonstrate the usage of distillation metric, we use the results reported in Walawalkar, Shen & Savvides (2020) using CIFAR100 dataset Krizhevsky (2009) and the Resnet44 architecture He et al (2016) . In their experiment, they trained four student models having relative sizes of 62.84%, 35.36%, 15.25% and 3.74% as compared to the teacher model.…”
Section: Distillation Metricmentioning
confidence: 99%
See 2 more Smart Citations
“…To demonstrate the usage of distillation metric, we use the results reported in Walawalkar, Shen & Savvides (2020) using CIFAR100 dataset Krizhevsky (2009) and the Resnet44 architecture He et al (2016) . In their experiment, they trained four student models having relative sizes of 62.84%, 35.36%, 15.25% and 3.74% as compared to the teacher model.…”
Section: Distillation Metricmentioning
confidence: 99%
“… Walawalkar, Shen & Savvides (2020) proposed to train an ensemble of models that is broken down into four blocks, where all models share the first block of layers. The first model in the ensemble is considered the teacher (termed pseudo teacher in the paper).…”
Section: Surveymentioning
confidence: 99%
See 1 more Smart Citation
“…Thus, finding a proper teacher-student architecture in offline distillation is challenging. In contrast, online distillation provides a one-phase end-to-end training scheme via teacher-student collaborative learning on a peer-network architecture instead of a fixed one [25,33,36,32,28,12].…”
Section: Introductionmentioning
confidence: 99%
“…受深度互学习 [17] 和在线集成模型压缩 [18] 方法启 发, 为了使压缩模型中每一个压缩子模型之间互相 增强学习, 本文将压缩子模型(去除需要学习更新权 重的模型)的类别输出平均值作为监督信息, 同样采 用该输出平均值的 Softmax 概率与需要更新学习权重 的压缩子模型输出预测值 Softmax 概率的 KL 散度作 为损失函数, 具体形式可以表示为 [20] 启发, 本文使用参数矩阵的 21  范数作为 稀疏正则, 同时考虑卷积层输出通道对应当前层和 下一层的参数 [21] . 于是, 本文采用的动态稀疏正则 [22] 可以表示为…”
unclassified