2022
DOI: 10.48550/arxiv.2207.12106
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Black-box Few-shot Knowledge Distillation

Abstract: Knowledge distillation (KD) is an efficient approach to transfer the knowledge from a large "teacher" network to a smaller "student" network. Traditional KD methods require lots of labeled training samples and a white-box teacher (parameters are accessible) to train a good student. However, these resources are not always available in real-world applications. The distillation process often happens at an external party side where we do not have access to much data, and the teacher does not disclose its parameter… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 25 publications
0
1
0
Order By: Relevance
“…These pruning methods usually require a large amount of labeled data, and the training process is very time-consuming [9], [17]. The knowledge distillation method transfers knowledge from the pre-trained teacher network to the student network, and trains the student network by making students imitate the output of the teacher network, so as to achieve the performance of the teacher network [18]- [20]. However, since the student network is usually set to be randomly initialized, it needs to rely on a large amount of data for knowledge transfer to train a model with good performance [9], [21].…”
Section: Introductionmentioning
confidence: 99%
“…These pruning methods usually require a large amount of labeled data, and the training process is very time-consuming [9], [17]. The knowledge distillation method transfers knowledge from the pre-trained teacher network to the student network, and trains the student network by making students imitate the output of the teacher network, so as to achieve the performance of the teacher network [18]- [20]. However, since the student network is usually set to be randomly initialized, it needs to rely on a large amount of data for knowledge transfer to train a model with good performance [9], [21].…”
Section: Introductionmentioning
confidence: 99%