Decoupled Knowledge Distillation

Zhao, Baohua; Cui, Quan; Song, Ren‐Jie; Qiu, Yuxuan; Liang, Jiajun

doi:10.1109/cvpr52688.2022.01165

Cited by 434 publications

(101 citation statements)

References 50 publications

Supporting

Mentioning

101

Contrasting

Order By: Relevance

“…2(b) can also be achieved by the metric loss. To further analyze the role of positive and negative pairs, we decouple the KL divergence into positive and negative pair distillation as proposed by DKD [77], showing that positive pair distillation leads to performance degradation (see Table 3). DwoPP: Distillation without Positive Pairs.…”

Section: Distillation Without Positive Pairs (Dwopp)mentioning

confidence: 99%

“…Table 3: Decoupling Eq. 6 into PPKD and NPKD with coefficients α and β on Market-1501 with temperature T = 1.0. ρ is the positive probabilities as in DKD [77].…”

Section: Comparative Performance Evaluationmentioning

confidence: 99%

“…Influence of positive pairs on distillation. To better understand the role of positive pairs (PP) and negative pairs (NP) in knowledge distillation, we decouple the knowledge distillation (following DKD [77]) from Eq. 6 into PPKD and NPKD by L DwPP* = α * PPKD + β * NPKD, α + β = 1.0 (here we use T = 1.0).…”

Section: Comparative Performance Evaluationmentioning

confidence: 99%

See 2 more Smart Citations

Positive Pair Distillation Considered Harmful: Continual Meta Metric Learning for Lifelong Object Re-Identification

Wang¹,

Wu²,

Bagdanov³

et al. 2022

Preprint

View full text Add to dashboard Cite

Lifelong object re-identification incrementally learns from a stream of re-identification tasks. The objective is to learn a representation that can be applied to all tasks and that generalizes to previously unseen re-identification tasks. The main challenge is that at inference time the representation must generalize to previously unseen identities. To address this problem, we apply continual meta metric learning to lifelong object reidentification. To prevent forgetting of previous tasks, we use knowledge distillation and explore the roles of positive and negative pairs. Based on our observation that the distillation and metric losses are antagonistic, we propose to remove positive pairs from distillation to robustify model updates. Our method, called Distillation without Positive Pairs (DwoPP), is evaluated on extensive intra-domain experiments on person and vehicle re-identification datasets, as well as inter-domain experiments on the LReID benchmark. Our experiments demonstrate that DwoPP significantly outperforms the state-of-the-art. IntroductionObject re-identification (ReID) aims to associate the identity of a query image with those in a gallery set [18,75]. It is applied to many applications, including person re-identification [5,

show abstract

Section: Distillation Without Positive Pairs (Dwopp)mentioning

confidence: 99%

“…Table 3: Decoupling Eq. 6 into PPKD and NPKD with coefficients α and β on Market-1501 with temperature T = 1.0. ρ is the positive probabilities as in DKD [77].…”

Section: Comparative Performance Evaluationmentioning

confidence: 99%

See 1 more Smart Citation

Positive Pair Distillation Considered Harmful: Continual Meta Metric Learning for Lifelong Object Re-Identification

Wang¹,

Wu²,

Bagdanov³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…[13] first propose the concept of knowledge distillation, where the student mimics the soft predictions from teacher. Knowledge distillation has been utilized in various fields including classification [47] [29] [1], object detection [37] [46], semantic segmentation [36] [22]. According to the objective of mimicking, knowledge distillation can be divided into three categories: response-based [47], feature-based [12] [38] and relation-based [40] [41], which distill with logits, intermediate activations and the relation of features in different layers respectively.…”

Section: Knowledge Distillationmentioning

confidence: 99%

Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling

Wang¹,

Li²,

Wen³

et al. 2022

Preprint

View full text Add to dashboard Cite

DETR is a novel end-to-end transformer architecture object detector, which significantly outperforms classic detectors when scaling up the model size. In this paper, we focus on the compression of DETR with knowledge distillation. While knowledge distillation has been well-studied in classic detectors, there is a lack of researches on how to make it work effectively on DETR. We first provide experimental and theoretical analysis to point out that the main challenge in DETR distillation is the lack of consistent distillation points. Distillation points refer to the corresponding inputs of the predictions for student to mimic, and reliable distillation requires sufficient distillation points which are consistent between teacher and student. Based on this observation, we propose a general knowledge distillation paradigm for DETR(KD-DETR) with consistent distillation points sampling. Specifically, we decouple detection and distillation tasks by introducing a set of specialized object queries to construct distillation points. In this paradigm, we further propose a general-to-specific distillation points sampling strategy to explore the extensibility of KD-DETR. Extensive experiments on different DETR architectures with various scales of backbones and transformer layers validate the effectiveness and generalization of KD-DETR. KD-DETR boosts the performance of DAB-DETR with ResNet-18 and ResNet-50 backbone to 41.4%, 45.7% mAP, respectively, which are 5.2%, 3.5% higher than the baseline, and ResNet-50 even surpasses the teacher model by 2.2%.

show abstract

“…D EEP neural networks are widely used in various computer vision tasks [1]- [4] and have achieved remarkable results [5], [6]. However, the current state-of-the-art deep models suffer from huge energy consumption, high operating and storage costs, which greatly hinder their deployment in resource-efficient situations [7]- [9].…”

Section: Introductionmentioning

confidence: 99%

Progressive Network Grafting With Local Features Embedding for Few-Shot Knowledge Distillation

2022

IEEE Access

View full text Add to dashboard Cite

Compared with traditional knowledge distillation, which relies on a large amount of data, few-shot knowledge distillation can distill student networks with good performance using only a small number of samples. Some recent studies treat the network as a combination of a series of network blocks, adopt a progressive graft strategy, and use the output of the teacher network to distill the student network. However, this strategy ignores the importance of the local feature information generated by the teacher block, which indicates what features should be learned by the corresponding student block. In this paper, we argue that using the features output from the teacher block can guide the student block to further learn more useful information from the teacher block. Therefore, we propose a joint learning framework for few-shot knowledge distillation that exploits both the output of the teacher network and the local features generated by the teacher block to optimize the student network. The local features will guide the student block to learn the output of the teacher block, and the output of the teacher network will allow the student network to take its learned local features to better contribute to the classification. In addition, further model compression was carried out to design a series of student networks with fewer number of parameters by reducing the number of network channels. Finally, extensive experiments using the model on CIFAR10 and CIFAR100 datasets show that our method outperforms SOTA, and our method has considerable advantages even with a very small number of parameters in further model compression experiments.

show abstract

Decoupled Knowledge Distillation

Cited by 434 publications

References 50 publications

Positive Pair Distillation Considered Harmful: Continual Meta Metric Learning for Lifelong Object Re-Identification

Positive Pair Distillation Considered Harmful: Continual Meta Metric Learning for Lifelong Object Re-Identification

Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling

Progressive Network Grafting With Local Features Embedding for Few-Shot Knowledge Distillation

Contact Info

Product

Resources

About