2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01165
|View full text |Cite
|
Sign up to set email alerts
|

Decoupled Knowledge Distillation

Abstract: Distilling knowledge from convolutional neural networks (CNNs) is a double-edged sword for vision transformers (ViTs). It boosts the performance since the imagefriendly local-inductive bias of CNN helps ViT learn faster and better, but leading to two problems: (1) Network designs of CNN and ViT are completely different, which leads to different semantic levels of intermediate features, making spatial-wise knowledge transfer methods (e.g., feature mimicking) inefficient. (2) Distilling knowledge from CNN limits… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
101
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 434 publications
(101 citation statements)
references
References 50 publications
0
101
0
Order By: Relevance
“…2(b) can also be achieved by the metric loss. To further analyze the role of positive and negative pairs, we decouple the KL divergence into positive and negative pair distillation as proposed by DKD [77], showing that positive pair distillation leads to performance degradation (see Table 3). DwoPP: Distillation without Positive Pairs.…”
Section: Distillation Without Positive Pairs (Dwopp)mentioning
confidence: 99%
See 2 more Smart Citations
“…2(b) can also be achieved by the metric loss. To further analyze the role of positive and negative pairs, we decouple the KL divergence into positive and negative pair distillation as proposed by DKD [77], showing that positive pair distillation leads to performance degradation (see Table 3). DwoPP: Distillation without Positive Pairs.…”
Section: Distillation Without Positive Pairs (Dwopp)mentioning
confidence: 99%
“…Table 3: Decoupling Eq. 6 into PPKD and NPKD with coefficients α and β on Market-1501 with temperature T = 1.0. ρ is the positive probabilities as in DKD [77].…”
Section: Comparative Performance Evaluationmentioning
confidence: 99%
See 1 more Smart Citation
“…[13] first propose the concept of knowledge distillation, where the student mimics the soft predictions from teacher. Knowledge distillation has been utilized in various fields including classification [47] [29] [1], object detection [37] [46], semantic segmentation [36] [22]. According to the objective of mimicking, knowledge distillation can be divided into three categories: response-based [47], feature-based [12] [38] and relation-based [40] [41], which distill with logits, intermediate activations and the relation of features in different layers respectively.…”
Section: Knowledge Distillationmentioning
confidence: 99%
“…D EEP neural networks are widely used in various computer vision tasks [1]- [4] and have achieved remarkable results [5], [6]. However, the current state-of-the-art deep models suffer from huge energy consumption, high operating and storage costs, which greatly hinder their deployment in resource-efficient situations [7]- [9].…”
Section: Introductionmentioning
confidence: 99%