2023
DOI: 10.1145/3568679
|View full text |Cite
|
Sign up to set email alerts
|

Hierarchical Multi-Attention Transfer for Knowledge Distillation

Abstract: Knowledge distillation (KD) is a powerful and widely applicable technique for the compression of deep learning models. The main idea of knowledge distillation is to transfer knowledge from a large teacher model to a small student model, where the attention mechanism has been intensively explored in regard to its great flexibility for managing different teacher-student architectures. However, existing attention-based methods usually transfer similar attention knowledge from the intermediate layers of deep neura… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0

Year Published

2023
2023
2025
2025

Publication Types

Select...
9

Relationship

0
9

Authors

Journals

citations
Cited by 34 publications
(7 citation statements)
references
References 30 publications
0
7
0
Order By: Relevance
“…Knowledge distillation is a model-independent strategy that transfers the knowledge from the pretrained teacher network to guide the training of the student network. Knowledge distillation was originally proposed for model compression [56,57]. By learning the knowledge of the large teacher network, the lightweight student network can achieve results close to or even better than the teacher network [58][59][60].…”
Section: Knowledge Distillationmentioning
confidence: 99%
“…Knowledge distillation is a model-independent strategy that transfers the knowledge from the pretrained teacher network to guide the training of the student network. Knowledge distillation was originally proposed for model compression [56,57]. By learning the knowledge of the large teacher network, the lightweight student network can achieve results close to or even better than the teacher network [58][59][60].…”
Section: Knowledge Distillationmentioning
confidence: 99%
“…Guo and his team [56] utilized a Recurrent Neural Network (RNN) for super-resolution data reconstruction. This enhanced the detection capabilities of UAVs and improved detection and positioning algorithms.…”
Section: Related Workmentioning
confidence: 99%
“…AT [21] proposed to transfer feature attention knowledge by minimizing the L2 distance of the spatial attention maps between the teacher and student. Furthermore, several recent works [12], [13], [14], [22] also explored the channel knowledge representations and transferred them to the student. HMAT [22] proposed a hierarchical multi-attention transfer framework for knowledge distillation, which utilizes position-based, channel-based, and activation-based attention to transfer knowledge at different levels of deep representations.…”
Section: Related Work a Knowledge Distillationmentioning
confidence: 99%
“…Furthermore, several recent works [12], [13], [14], [22] also explored the channel knowledge representations and transferred them to the student. HMAT [22] proposed a hierarchical multi-attention transfer framework for knowledge distillation, which utilizes position-based, channel-based, and activation-based attention to transfer knowledge at different levels of deep representations. The knowledge can also be defined by the probability distributions of the representations.…”
Section: Related Work a Knowledge Distillationmentioning
confidence: 99%