2020
DOI: 10.1609/aaai.v34i04.5746
|View full text |Cite
|
Sign up to set email alerts
|

Online Knowledge Distillation with Diverse Peers

Abstract: Distillation is an effective knowledge-transfer technique that uses predicted distributions of a powerful teacher model as soft targets to train a less-parameterized student model. A pre-trained high capacity teacher, however, is not always available. Recently proposed online variants use the aggregated intermediate predictions of multiple student models as targets to train each student model. Although group-derived targets give a good recipe for teacher-free distillation, group members are homogenized quickly… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
163
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 244 publications
(163 citation statements)
references
References 8 publications
0
163
0
Order By: Relevance
“…However, most of them requires the networks share the same architecture or performing the same classification task [8][17] [18]. Or the teachers are modeled as different branches in a large network, which is difficult to be applied to IoT [5] [19] [20]. The BAM model [8], can perform "multiÑsingle" distillation tasks, but places limitation on network structures.…”
Section: B Multi-teacher Knowledge Distillationmentioning
confidence: 99%
See 4 more Smart Citations
“…However, most of them requires the networks share the same architecture or performing the same classification task [8][17] [18]. Or the teachers are modeled as different branches in a large network, which is difficult to be applied to IoT [5] [19] [20]. The BAM model [8], can perform "multiÑsingle" distillation tasks, but places limitation on network structures.…”
Section: B Multi-teacher Knowledge Distillationmentioning
confidence: 99%
“…where a 1 i is the updated weight for f i and γ is the learning rate. Note that the updated logit feature z clo is related to ta i u M i"1 as the gradient computed shown in equation (19) contains a i . Minimizing L ER clo enables dynamic adaption of the weights assigned to the fog networks.…”
Section: Brain Storm Trainingmentioning
confidence: 99%
See 3 more Smart Citations